作者:蒋士峰,爱可生 DBA 团队成员,熟悉 MySQL,Oracle 等数据库。每天的积累,时间久了,会带来不一样的收获。
爱可生开源社区出品,原创内容未经授权不得随意使用,转载请联系小编并注明来源。
本文约 2500 字,预计阅读需要 8 分钟。
—— 贺春旸 凡普金科和爱钱进 DBA 团队负责人 2025.3.10
在日常的工作中,我们经常遇到这样的问题:
这些问题,说起来大家都知道怎么回事,也知道怎么处理。但是出现一次两次还好说,如果频繁出现,不仅处理起来繁琐困难,也说明了主从基本上处于废置状态。生产环境中,通常数据量比较大,又有业务在进行读写,重置一次主从关系的成本比较高。
那么,针对频繁出现的 1032 和 1062 错误,今天就为大家介绍一款工具 pt-slave-repair[1] ,让大家花最小的成本来处理这个常见的主从故障问题。
pt-slave-restart[2] 是 percona-toolkit 工具集[3] 的一款 MySQL 主从异常处理工具。
pt-slave-repair 是对原有 pt-slave-restart 工具的补充,作者贺春旸。它提供自动修复 MySQL 主从同步复制的报错数据,以及恢复中断的 SQL_THEARD 复制线程,了解更多[4]。
在具体实操之前,先引用一段作者对 pt-slave-repair 工具的工作原理介绍。
show slave status
信息,得到 binlog、position、gtid 信息。slave_parallel_workers
多线程并行复制。SET gtid_next
方式;如果开启位置点复制模式,启动 SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1
方式。show slave status
同步为双 YES(同步正常)。先对复制用户 repl 添加相应权限。
3306[(none)]>show grants for repl@'%';
+----------------------------------------------+
| Grants for repl@% |
+----------------------------------------------+
| GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%' |
+----------------------------------------------+
1 row in set (0.00 sec)
3306[(none)]>GRANT SUPER, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'repl'@'%';
Query OK, 0 rows affected (0.00 sec)
3306[(none)]>GRANT SELECT, INSERT, UPDATE, DELETE ON `test`.* TO 'repl'@'%' ;
Query OK, 0 rows affected (0.00 sec)
3306[(none)]>GRANT SELECT ON `performance_schema`.* TO 'repl'@'%';
Query OK, 0 rows affected (0.01 sec)
3306[(none)]>show grants for repl@'%';
+-------------------------------------------------------------------------+
| Grants for repl@% |
+-------------------------------------------------------------------------+
| GRANT SUPER, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'repl'@'%' |
| GRANT SELECT, INSERT, UPDATE, DELETE ON `test`.* TO 'repl'@'%' |
| GRANT SELECT ON `performance_schema`.* TO 'repl'@'%' |
+-------------------------------------------------------------------------+
3 rows in set (0.00 sec)
3306[(none)]>flush privileges;
Query OK, 0 rows affected (0.00 sec)
然后安装 pt-slave-repair 工具。
wget https://github.com/hcymysql/pt-slave-repair/releases/download/pt-slave-repair_v1.0.8/pt-slave-repair
chmod 755 pt-slave-repair
至此,pt-slave-repair工具使用前的准备工作完成。下面我们来讲解工具的使用。
# 主库创建一张 t1 表:
CREATE TABLE `t1` (
`id` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
# 主库插入两条数据到 t1 表,并且不写入 binlog 日志。
# 这时就会出现,主库有这两条数据,从库不存在着两条数据的情况。
set @@sql_log_bin=0;
insert into t1 values(2),(3);
set @@sql_log_bin=1;
# 主库删除刚刚插入的两条数据 2 3
delete from t1 where id=2;
delete from t1 where id=3;
# 此时,主从 1032 错误出现
3306[test]>show slave status \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.3.14
Master_User: repl
Master_Port: 3306
...
Slave_IO_Running: Yes
Slave_SQL_Running: No
...
Last_Errno: 1032
Last_Error: Could not execute Delete_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 92274
使用 pt-slave-repair 工具修复 1032 错误(一定要连接到从库上运行 !!!)。
/root/pt-slave-repair -H 192.168.3.14 -P 3306 -u repl -p 123456 -d test
# 192.168.3.14:3306 这是一台主库,环境不匹配!
# 前台运行
/root/pt-slave-repair -H 192.168.3.13 -P 3306 -u repl -p 123456 -d test
2025-02-21 04:46:36,015 - ERROR - 主从复制报错. Slave_IO_Running状态值是:Yes | Slave_SQL_Running状态值是:No
Last_Error错误信息是:Could not execute Delete_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 92274
Last_SQL_Error 错误信息是:Could not execute Delete_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 92274
2025-02-21 04:46:36,024 - ERROR - 错误号是:1032
2025-02-21 04:46:36,024 - ERROR - 错误信息是:Could not execute Delete_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 92274
2025-02-21 04:46:36,024 - ERROR - 报错时间是:2025-02-21 04:14:26
2025-02-21 04:46:36,024 - INFO - ----------------------------------------------------------------------------------------------------
2025-02-21 04:46:36,025 - INFO - 开始自动修复同步错误的数据......
2025-02-21 04:46:36,064 - INFO - 修复数据的SQL语句: delete
2025-02-21 04:46:36,393 - INFO - 成功修复了 【2】 行数据
2025-02-21 04:46:37,448 - ERROR - 主从复制报错. Slave_IO_Running状态值是:Yes | Slave_SQL_Running状态值是:No
Last_Error错误信息是:Could not execute Delete_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 92527
Last_SQL_Error错误信息是:Could not execute Delete_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 92527
2025-02-21 04:46:37,450 - ERROR - 错误号是:1032
2025-02-21 04:46:37,450 - ERROR - 错误信息是:Could not execute Delete_rows event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log mysql-bin.000001, end_log_pos 92527
2025-02-21 04:46:37,450 - ERROR - 报错时间是:2025-02-21 04:46:35
2025-02-21 04:46:37,450 - INFO - ----------------------------------------------------------------------------------------------------
2025-02-21 04:46:37,450 - INFO - 开始自动修复同步错误的数据......
2025-02-21 04:46:37,488 - INFO - 修复数据的SQL语句: delete
2025-02-21 04:46:37,824 - INFO - 成功修复了 【2】 行数据
2025-02-21 04:46:38,887 - WARNING - 开启基于GTID全局事务ID复制,CHANGE MASTER TO MASTER_AUTO_POSITION = 1 需要设置为1.
2025-02-21 04:46:38,887 - INFO - 同步正常.
#再次检查主从状态
3306[test]>show slave status \G
...
Master_Host: 192.168.3.14
Master_User: repl
Master_Port: 3306
...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
后台运行该工具。
nohup /root/pt-slave-repair -H 192.168.3.13 -P 3306 -u repl -p 123456 -d test > /dev/null &
1062 错误为主从同步时主键冲突提示,表示尝试插入或更新数据时违反了唯一约束。
主要出现在以下两种场景:
# 主库:
CREATE TABLE t2 (
id INT PRIMARY KEY, -- 主键,唯一约束
name VARCHAR(100)
);
INSERT INTO t2 (id, name) VALUES (1, 'Alice');
INSERT INTO t2 (id, name) VALUES (2, 'Bob');
# 从库插入一条数据,并且不写入binlog:
set @@sql_log_bin=0;
INSERT INTO t2 (id, name) VALUES (3, 'jiang');
set @@sql_log_bin=1;
# 主库再次插入一条主键为3的数据:
INSERT INTO t2 (id, name) VALUES (3, 'shifeng');
# 主从报 1062 错误:
3306[test]>show slave status \G
...
Master_Host: 192.168.3.14
Master_User: repl
Master_Port: 3306
...
Last_Errno: 1062
Last_Error: Could not execute Write_rows event on table test.t2; Duplicate entry '3' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000001, end_log_pos 93519
Skip_Counter: 0
Exec_Master_Log_Pos: 93286
使用 pt-slave-repair 工具修复1062错误(一定要连接到从库上运行 !!!)。
[root@node4 ~]# /root/pt-slave-repair -H 192.168.3.13 -P 3306 -u repl -p 123456 -d test
2025-02-21 05:23:17,145 - ERROR - 主从复制报错. Slave_IO_Running状态值是:Yes | Slave_SQL_Running状态值是:No
Last_Error错误信息是:Could not execute Write_rows event on table test.t2; Duplicate entry '3' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000001, end_log_pos 93519
Last_SQL_Error错误信息是:Could not execute Write_rows event on table test.t2; Duplicate entry '3' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000001, end_log_pos 93519
2025-02-21 05:23:17,147 - ERROR - 错误号是:1062
2025-02-21 05:23:17,147 - ERROR - 错误信息是:Could not execute Write_rows event on table test.t2; Duplicate entry '3' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000001, end_log_pos 93519
2025-02-21 05:23:17,147 - ERROR - 报错时间是:2025-02-21 05:20:02
2025-02-21 05:23:17,147 - INFO - ----------------------------------------------------------------------------------------------------
2025-02-21 05:23:17,147 - INFO - 开始自动修复同步错误的数据......
2025-02-21 05:23:17,201 - INFO - 修复数据的SQL语句: REPLACE INTO `test`.`t2`(`id`,`name`) VALUES (3'shifeng');
2025-02-21 05:23:17,553 - INFO - 成功修复了 【1】 行数据
2025-02-21 05:23:18,619 - WARNING - 开启基于GTID全局事务ID复制,CHANGE MASTER TO MASTER_AUTO_POSITION = 1 需要设置为1.
2025-02-21 05:23:18,620 - INFO - 同步正常.
# 再次检查主从状态
3306[test]>show slave status \G
...
Master_Host: 192.168.3.14
Master_User: repl
Master_Port: 3306
...
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
更多细节,请参考工作作者贺春旸老师的 演示视频[5]。
[1]
pt-slave-repair: https://github.com/hcymysql/pt-slave-repair
[2]
pt-slave-restart: https://docs.percona.com/percona-toolkit/pt-slave-restart.html
[3]
percona-toolkit: https://docs.percona.com/percona-toolkit/
[4]
dba+开源工具:自动修复MySQL主从同步复制的报错数据: https://dbaplus.cn/news-11-5524-1.html
[5]
演示视频: https://edu.51cto.com/video/1658.html