首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >复制延迟期间的 MySQL Orchestrator 故障转移行为

复制延迟期间的 MySQL Orchestrator 故障转移行为

作者头像
保持热爱奔赴山海
发布2025-09-18 16:49:20
发布2025-09-18 16:49:20
980
举报
文章被收录于专栏:数据库相关数据库相关

在复制环境下,借助MySQL 协调器工具,管理 MySQL 服务器集群非常高效。这确保了在发生任何临时故障转移或计划内/优雅切换时能够平稳过渡。

一些配置参数在控制和影响故障转移行为方面起着至关重要的作用。在本篇博文中,我们将探讨其中一些关键选项,以及它们如何影响整个故障转移过程。

让我们通过一些例子来逐一讨论这些设置。

如果SQLThreadNotUpToDate,则Master升级失败

默认情况下,此选项是禁用的。 但是,当该选项为“ true ”时,如果主服务器发生故障转移,而候选主服务器尚未消费所有中继日志事件,则故障转移或提升过程将终止。

如果此设置保持为“ false”,那么在所有副本都滞后且当前主节点宕机的情况下,其中一个成员将被选为新的主节点,这最终可能导致新主节点上的数据丢失。之后,当旧主节点再次被添加为副本时, 可能会导致重复条目问题。

考虑到在 Orchestrator 配置文件“orchestrator.conf.json”中启用了“ FailMasterPromotionIfSQLThreadNotUpToDate”:

代码语言:txt
复制
"FailMasterPromotionIfSQLThreadNotUpToDate": true

这是由协调器管理的拓扑:

代码语言:txt
复制
Anils-MacBook-Pro.local:22637   [0s,ok,8.0.36,rw,ROW,>>,GTID]
+ Anils-MacBook-Pro.local:22638 [0s,ok,8.0.36,ro,ROW,>>,GTID]
+ Anils-MacBook-Pro.local:22639 [0s,ok,8.0.36,ro,ROW,>>,GTID]

下面,我们通过 sysbench 运行一些工作负载,这将有助于增加复制滞后以满足我们的测试目的。 

代码语言:txt
复制
sysbench 
--db-driver=mysql 
--mysql-user=sbtest_user 
--mysql-password=Sbtest@2022 
--mysql-db=sbtest 
--mysql-host=127.0.0.1 
--mysql-port=22637 
--tables=15 
--table-size=3000000 
--create_secondary=off 
--threads=100 
--time=0 
--events=0 
--report-interval=1 /opt/homebrew/Cellar/sysbench/1.0.20_7/share/sysbench/oltp_read_write.lua run

输出:

代码语言:txt
复制
 256s ] thds: 100 tps: 1902.77 qps: 38008.31 (r/w/o: 26611.72/7591.05/3805.54) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 257s ] thds: 100 tps: 1960.71 qps: 38960.02 (r/w/o: 27292.45/7746.15/3921.42) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
[ 258s ] thds: 100 tps: 1803.48 qps: 35773.52 (r/w/o: 24991.65/7174.91/3606.96) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

经过一段时间后,副本上的复制滞后开始增加,与此同时,我们刚刚停止了主服务器 [ 127.0.0.1:22637]。

代码语言:txt
复制
slave1 [localhost:22638] {msandbox} ((none)) > show slave statusG;
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22637
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000012
          Read_Master_Log_Pos: 941353804
               Relay_Log_File: mysql-relay.000034
                Relay_Log_Pos: 318047296
        Relay_Master_Log_File: mysql-bin.000012
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
…

       Seconds_Behind_Master: 215
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: Error reconnecting to source 'rsandbox@127.0.0.1:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61)
                

slave2 [localhost:22639] {msandbox} ((none)) > show slave statusG;
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22637
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000012
          Read_Master_Log_Pos: 941353804
               Relay_Log_File: mysql-relay.000002
                Relay_Log_Pos: 302890408
        Relay_Master_Log_File: mysql-bin.000012
             Slave_IO_Running: No
            Slave_SQL_Running: Yes

…
        Seconds_Behind_Master: 215
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: Error reconnecting to source 'rsandbox@127.0.0.1:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61)
                

结果是由于 SQL 线程不是最新的,导致主提升失败。 

代码语言:txt
复制
2025-06-04 23:42:21 ERROR RecoverDeadMaster: failed promotion. FailMasterPromotionIfSQLThreadNotUpToDate is set and promoted replica Anils-MacBook-Pro.local:22638 's sql thread is not up to date (relay logs still unapplied). Aborting promotion

现在,如果选项“ FailMasterPromotionIfSQLThreadNotUpToDate ”为 false 或默认值,即使副本遭受复制滞后,故障转移也将完美进行。

在上述相同场景中,在 “FailMasterPromotionIfSQLThreadNotUpToDate”:false 条件下,主升级成功完成。

猫/tmp/recovery.log:

代码语言:txt
复制
20250604 23:51:19:  Detected AllMasterReplicasNotReplicating on Anils-MacBook-Pro.local:22637. Affected replicas: 2
20250604 23:52:41:  Detected DeadMaster on Anils-MacBook-Pro.local:22637. Affected replicas: 2
20250604 23:52:56:  Will recover from DeadMaster on Anils-MacBook-Pro.local:22637
20250604 23:53:07:  Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Promoted: Anils-MacBook-Pro.local:22638
20250604 23:53:07:  (for all types) Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Successor: Anils-MacBook-Pro.local:22638

如果SQLThreadNotUpToDate,则延迟MasterPromotion

此参数与 我们上面讨论的相反。在这里,它不会中止主服务器故障转移,而是会延迟到候选主服务器使用完所有中继日志文件后再进行故障转移。当该参数为“ true ”时, orchestrator 进程将等待 SQL 线程赶上进度,然后再升级到新的主服务器。

考虑到在 Orchestrator 配置文件“orchestrator.conf.json”中启用了“ DelayMasterPromotionIfSQLThreadNotUpToDate”

代码语言:txt
复制
"DelayMasterPromotionIfSQLThreadNotUpToDate": true

主节点[127.0.0.1:22637 ]上有一些工作负载 ,几秒钟后,复制延迟开始出现。我们因此停止了主节点。

我们可以在日志文件 /tmp /tmp/recovery.log 中看到故障转移初始过程已启动。 

代码语言:txt
复制
20250605 19:10:04:  Detected UnreachableMasterWithLaggingReplicas on Anils-MacBook-Pro.local:22637. Affected replicas: 2
20250605 19:10:06:  Detected DeadMaster on Anils-MacBook-Pro.local:22637. Affected replicas: 2
20250605 19:10:06:  Will recover from DeadMaster on Anils-MacBook-Pro.local:22637
20250605 19:10:16:  Will recover from DeadMaster on Anils-MacBook-Pro.local:22637

然而,我们可以观察到,由于候选主服务器 [ Anils-MacBook-Pro.local:22638 ] 上的复制滞后,提升被暂停,以恢复故障转移之前的滞后。

代码语言:txt
复制
2025-06-05 19:10:27 ERROR DelayMasterPromotionIfSQLThreadNotUpToDate error: 2025-06-05 19:10:27 ERROR WaitForSQLThreadUpToDate stale coordinates timeout on Anils-MacBook-Pro.local:22638 after duration 10s
...
2025-06-05 19:10:27 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 19:10:28 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 19:10:28 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 19:10:29 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 19:10:29 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 19:10:30 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
...
2025-06-05 19:10:37 INFO topology_recovery: DelayMasterPromotionIfSQLThreadNotUpToDate error: 2025-06-05 19:10:37 ERROR WaitForSQLThreadUpToDate stale coordinates timeout on Anils-MacBook-Pro.local:22638 after duration 10s

代码语言:txt
复制
slave1 [localhost:22638] {msandbox} ((none)) > show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22637
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000017
          Read_Master_Log_Pos: 981203536
               Relay_Log_File: mysql-relay.000002
                Relay_Log_Pos: 179833778
        Relay_Master_Log_File: mysql-bin.000017
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...
        Seconds_Behind_Master: 227


slave1 [localhost:22638] {msandbox} ((none)) > show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22637
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000017
          Read_Master_Log_Pos: 1017383839
               Relay_Log_File: mysql-relay.000002
                Relay_Log_Pos: 237223256
        Relay_Master_Log_File: mysql-bin.000017
             Slave_IO_Running: No
            Slave_SQL_Running: No

…
       Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: Error reconnecting to source 'rsandbox@127.0.0.1:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61)


slave2 [localhost:22639] {root} ((none)) > show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22637
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000017
          Read_Master_Log_Pos: 907360640
               Relay_Log_File: mysql-relay.000004
                Relay_Log_Pos: 167191740
        Relay_Master_Log_File: mysql-bin.000017
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...
        Seconds_Behind_Master: 210
        
        
slave2 [localhost:22639] {root} ((none)) > show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22637
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000017
          Read_Master_Log_Pos: 1017383839
               Relay_Log_File: mysql-relay.000004
                Relay_Log_Pos: 237135927
        Relay_Master_Log_File: mysql-bin.000017
             Slave_IO_Running: No
            Slave_SQL_Running: No

…
       Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: Error reconnecting to source 'rsandbox@127.0.0.1:22637'. This was attempt 1/86400, with a delay of 60 seconds between attempts. Message: Can't connect to MySQL server on '127.0.0.1:22637' (61)

这里还有一个观察结果。如果我们在所有副本上都存在复制滞后的情况下进行优雅切换,我们将收到以下消息。

代码语言:txt
复制
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 

输出

代码语言:txt
复制
Desginated instance Anils-MacBook-Pro.local:22638 seems to be lagging too much for this operation. Aborting.

发生这种情况是因为 以下情况,即复制滞后应等于或小于定义的“ReasonableMaintenanceReplicationLagSeconds:20。

代码语言:txt
复制
if !designatedInstance.HasReasonableMaintenanceReplicationLag() {
        return nil, nil, fmt.Errorf("Desginated instance %+v seems to be lagging too much for this operation. Aborting.", designatedInstance.Key)
    }
    
func (this *Instance) HasReasonableMaintenanceReplicationLag() bool {
    // replicas with SQLDelay are a special case
    if this.SQLDelay > 0 {
        return math.AbsInt64(this.SecondsBehindMaster.Int64-int64(this.SQLDelay)) <= int64(config.Config.ReasonableMaintenanceReplicationLagSeconds)
    }
    return this.SecondsBehindMaster.Int64 <= int64(config.Config.ReasonableMaintenanceReplicationLagSeconds)
}

https://github.com/openark/orchestrator/blob/730db91f70344e38296dbb0fecdbc0cefd6fca79/go/logic/topology_recovery.go#L2124

https://github.com/openark/orchestrator/blob/730db91f70344e38296dbb0fecdbc0cefd6fca79/go/inst/instance.go#L37

这里,编排器服务日志反映了故障转移过程现在正在等待所有中继日志完成。

代码语言:txt
复制
2025-06-05 20:15:01 INFO CommandRun successful. exit status 0
2025-06-05 20:15:01 INFO topology_recovery: Completed PreGracefulTakeoverProcesses hook 1 of 1 in 5.463653s
2025-06-05 20:15:01 INFO topology_recovery: done running PreGracefulTakeoverProcesses hooks
2025-06-05 20:15:01 INFO GracefulMasterTakeover: Will set Anils-MacBook-Pro.local:22637 as read_only
2025-06-05 20:15:01 INFO instance Anils-MacBook-Pro.local:22637 read_only: true
2025-06-05 20:15:01 INFO auditType:read-only instance:Anils-MacBook-Pro.local:22637 cluster:Anils-MacBook-Pro.local:22637 message:set as true
2025-06-05 20:15:01 INFO GracefulMasterTakeover: Will wait for Anils-MacBook-Pro.local:22638 to reach master coordinates mysql-bin.000021:221642748

2025-06-05 20:19:53 INFO topology_recovery: DelayMasterPromotionIfSQLThreadNotUpToDate: waiting for SQL thread on Anils-MacBook-Pro.local:22638
2025-06-05 20:19:53 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 20:19:53 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 20:19:54 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 20:19:54 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 20:19:55 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pro.local:22638
2025-06-05 20:19:55 DEBUG WaitForSQLThreadUpToDate waiting on Anils-MacBook-Pr

..

2025-06-05 20:26:37 INFO topology_recovery: DelayMasterPromotionIfSQLThreadNotUpToDate: SQL thread caught up on Anils-MacBook-Pro.local:22638
2025-06-05 20:26:37 INFO topology_recovery: RecoverDeadMaster: found no reason to override promotion of Anils-MacBook-Pro.local:22638
2025-06-05 20:26:37 INFO topology_recovery: RecoverDeadMaster: successfully promoted Anils-MacBook-Pro.local:22638
2025-06-05 20:26:37 INFO topology_recovery: - RecoverDeadMaster: promoted server coordinates: mysql-bin.000017:221167478
2025-06-05 20:26:37 INFO topology_recovery: - RecoverDeadMaster: will apply MySQL changes to promoted master
2025-06-05 20:26:37 INFO Will reset replica on Anils-MacBook-Pro.local:22638

一旦延迟问题解决,接管过程就会成功运行。

代码语言:txt
复制
20250605 20:19:53:  Will recover from DeadMaster on Anils-MacBook-Pro.local:22637
20250605 20:26:39:  Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Promoted: Anils-MacBook-Pro.local:22638
20250605 20:26:39:  (for all types) Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Successor: Anils-MacBook-Pro.local:22638
20250605 20:26:39:  Planned takeover complete

代码语言:txt
复制
shell> orchestrator-client -c topology -a testcluster    

输出:

代码语言:txt
复制
Anils-MacBook-Pro.local:22638   [0s,ok,8.0.36,rw,ROW,>>,GTID]
- Anils-MacBook-Pro.local:22637 [null,nonreplicating,8.0.36,ro,ROW,>>,GTID]
+ Anils-MacBook-Pro.local:22639 [0s,ok,8.0.36,ro,ROW,>>,GTID]

FailMasterPromotionOnLagMinutes

此参数确保当副本滞后时间 >= 配置的分钟数时,主节点提升将被中止。为了使用此标志,我们还必须使用“ ReplicationLagQuery ”和心跳机制“ pt-hearbeat ”来评估正确的复制滞后时间。

让我们看看它是如何工作的。

我们在 Orchestrator 配置“ orchestrator.conf.json”中设置了 以下值,以确保如果滞后超过约 1 分钟,主升级过程将失败。

代码语言:txt
复制
"FailMasterPromotionOnLagMinutes": 1,

正如我们上面所讨论的,启用此选项 取决于设置“ReplicationLagQuery”, 它从心跳机制获取复制滞后详细信息,而不是依赖于 seconds_behind_master状态

代码语言:txt
复制
2025-06-06 08:54:47 INFO starting orchestrator, version: 3.2.6, git commit: 89f3bdd33931d5e234890787a24cc035fa106b32
2025-06-06 08:54:47 INFO Read config: /Users/aniljoshi/orchestrator/conf/orchestrator.conf.json
2025-06-06 08:54:47 FATAL nonzero FailMasterPromotionOnLagMinutes requires ReplicationLagQuery to be set

默认情况下,Orchestrator 使用从服务器状态“ seconds_behind_master” 来监控复制延迟。然而,在复制已中断且主服务器也发生故障的情况下,“ seconds_behind_master” 的值将为 “null”,这最终将无法获取决策所需的准确详细信息。

因此,我们将使用pt-heartbeat作为复制延迟的来源。pt-heartbeat 是一个复制延迟监控系统,通过查看实际复制数据来测量延迟。它提供来自主服务器的“绝对”延迟以及亚秒级分辨率。

下面是“ ReplicationLagQuery”配置,我们将在orchestrator配置文件中定义它。

代码语言:txt
复制
"ReplicationLagQuery": "SELECT CAST((UNIX_TIMESTAMP(NOW()) - UNIX_TIMESTAMP(ts)) AS unsigned INTEGER) AS 'delay' FROM percona.heartbeat ORDER BY ts DESC LIMIT 1",

我们还需要一个单独的 pt-heartbeat 进程,它将在源/副本实例上运行。

Anils-MacBook-Pro.本地:22637:
代码语言:txt
复制
shell> pt-heartbeat --check-read-only --read-only-interval=1 --fail-successive-errors 5 --interval=0.1 --create-table --create-table-engine=InnoDB --database=percona --table=heartbeat --host=127.0.0.1 --user=heartbeat --password=Heartbeat@1234 --port=22637 --update &
Anils-MacBook-Pro.本地:22638:
代码语言:txt
复制
shell> pt-heartbeat --check-read-only --read-only-interval=1 --fail-successive-errors 5 --interval=0.1 --create-table --create-table-engine=InnoDB --database=percona --table=heartbeat --host=127.0.0.1 --user=heartbeat --password=Heartbeat@1234 --port=22638 --update &
Anils-MacBook-Pro.本地:22639:
代码语言:txt
复制
shell> pt-heartbeat --check-read-only --read-only-interval=1 --fail-successive-errors 5 --interval=0.1 --create-table --create-table-engine=InnoDB --database=percona --table=heartbeat --host=127.0.0.1 --user=heartbeat --password=Heartbeat@1234 --port=22639 --update &
  • –read-only-interval => 指定 –check-read-only 时,服务器被发现为只读时休眠的时间间隔。如果未指定,则使用 –interval。
  • –fail-successive-errors => 如果指定,pt-heartbeat 将在给定数量的连续 DBI 错误(无法连接到服务器或发出查询)后失败。
  • –interval => 更新或检查心跳表的频率。更新和检查从第一个整秒开始,然后对于 –update 每隔 –interval 秒重复一次,对于 –monitor 每隔 –interval 加上 –skew 秒重复一次。

参考 – https://docs.percona.com/percona-toolkit/pt-heartbeat.html

延迟在副本节点上计算,即当前系统时间与心跳表中复制的时间戳值之间的差值。基本上,在主节点上,pt-heartbeat每秒使用服务器 ID 和当前时间戳更新心跳表。这些更新通过异步复制复制到副本节点。 

例如,

代码语言:txt
复制
slave1 [localhost:22638] {msandbox} (percona) > select * from percona.heartbeat;
+----------------------------+-----------+------------------+----------+-----------------------+---------------------+
| ts                         | server_id | file             | position | relay_source_log_file | exec_source_log_pos |
+----------------------------+-----------+------------------+----------+-----------------------+---------------------+
| 2025-06-06T10:34:56.410320 |       100 | mysql-bin.000023 |  1654366 | NULL                  |                NULL |
+----------------------------+-----------+------------------+----------+-----------------------+---------------------+
2 rows in set (0.00 sec)

slave2 [localhost:22639] {root} (percona) > SELECT CAST((UNIX_TIMESTAMP(NOW()) - UNIX_TIMESTAMP(ts)) AS signed INTEGER) AS 'delay' FROM percona.heartbeat ORDER BY ts
DESC LIMIT 1;
+-------+
| delay |
+-------+
|    53 |
+-------+
1 row in set (0.01 sec)

让我们通过一个快速场景来看一下启用“ FailMasterPromotionOnLagMinutes”的行为。

我们在后台运行一些工作负载,导致复制延迟/滞后。

代码语言:txt
复制
slave1 [localhost:22638] {root} ((none)) > show slave statusGl;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22637
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000005
          Read_Master_Log_Pos: 355883980
               Relay_Log_File: mysql-relay.000003
                Relay_Log_Pos: 452576487
        Relay_Master_Log_File: mysql-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 453709829
              Relay_Log_Space: 2503386167
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 696

然后我们尝试进行主服务器平滑故障转移,但由于复制滞后而失败。

代码语言:txt
复制
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 
Desginated instance Anils-MacBook-Pro.local:22638 seems to be lagging too much for this operation. Aborting.

然而,一旦复制滞后 < 1 分钟(我们为 [FailMasterPromotionOnLagMinutes] 指定的条件),故障转移过程就会运行得非常顺利。

代码语言:txt
复制
slave1 [localhost:22638] {root} ((none)) > show slave statusGl;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22637
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000005
          Read_Master_Log_Pos: 605457002
               Relay_Log_File: mysql-relay.000008
                Relay_Log_Pos: 605455949
        Relay_Master_Log_File: mysql-bin.000005
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 605455733
              Relay_Log_Space: 605457511
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 1

因此,主服务器故障转移到“ Anils-MacBook-Pro.local:22638”

代码语言:txt
复制
shell> orchestrator-client -c graceful-master-takeover -alias testcluster -d Anils-MacBook-Pro.local:22638 
Anils-MacBook-Pro.local:22638

故障转移日志“/tmp/recovery.log”。

代码语言:txt
复制
20250607 22:19:53:  Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Promoted: Anils-MacBook-Pro.local:22638
20250607 22:19:54:  (for all types) Recovered from DeadMaster on Anils-MacBook-Pro.local:22637. Failed: Anils-MacBook-Pro.local:22637; Successor: Anils-MacBook-Pro.local:22638
20250607 22:19:54:  Planned takeover complete

结论

上述选项旨在控制 MySQL Orchestrator 故障转移过程的粒度,尤其是在副本出现复制滞后的情况下。本质上,我们可以选择等待滞后问题解决后再触发故障转移,或者即使出现滞后也立即进行故障转移。此外,[ FailMasterPromotionOnLagMinutes, FailMasterPromotionIfSQLThreadNotUpToDate ]等设置可确保在出现滞后问题时故障转移失败,从而提供最大程度的一致性。

本文系外文翻译,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文系外文翻译前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 如果SQLThreadNotUpToDate,则Master升级失败
  • 如果SQLThreadNotUpToDate,则延迟MasterPromotion
  • FailMasterPromotionOnLagMinutes
    • 结论
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档