作者:张政俊 就职于中欧基金,知数堂粉丝,数据库爱好者,熟悉RDBMS、nosql、new sql等各类数据库。
1.开发要求truncate一张表,在删除之前需要备份下数据;
2.在从库上使用mysqldump命令做逻辑备份,备份sql:
mysqldump -uXXX -p osdc osdc_XXX > /tmp/osdc_info.sql;
3.这台从库与主库的同步出现中断,报错为:
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 1756
Last_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details).
ERROR NO 是1756,而且只是 Slave_SQL_Running 停了。
stop slave;
start slave;
重新启停同步服务后,恢复正常。
之前形形色色的主从问题遇到过很多,但是 1756 这个错误还是第一次遇到,所以事后还是很有兴趣去深入分析下这个问题。
查看数据库的 error log ,这个报错的具体输出如下:
2020-08-18 22:04:19 208840 [Warning] Slave SQL: Worker 2 failed executing transaction 'fe0ed3ca-cacc-11e9-8d49-6c92bfcf73b6:1377629720' at master log mysql-bin.003536, end_log_pos 861741231
; Could not execute Update_rows event on table osdc.osdc_XXX; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's
master log mysql-bin.003536, end_log_pos 861741231, Error_code: 1205
2020-08-18 22:04:19 208840 [Note] Worker 2 is exiting: killed 0, error 1, running_status 1
2020-08-18 22:04:19 208840 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency aut
omatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for detail
s). Error_code: 1756
错误日志里有个 Lock wait timeout exceeded ,这个我们先记下,稍后再说。
先找到 master 事务的 gtid,直接去主库的 binlog 里找下这个 gtid 所对应的 sql:
SET @@SESSION.GTID_NEXT= 'fe0ed3ca-cacc-11e9-8d49-6c92bfcf73b6:1377629720'/*!*/;
# at 861740635
#200818 21:58:42 server id 31 end_log_pos 861740715 CRC32 0xa9ab0dd2 Query thread_id=9403182 exec_time=0 error_code=0
SET TIMESTAMP=1597759122/*!*/;
BEGIN
/*!*/;
# at 861740715
#200818 21:58:42 server id 31 end_log_pos 861740840 CRC32 0xbd8ba613 Rows_query
# update osdc_XXX
# set modify_time=now(),status='04'
# where id=9298XXXXX and status='01'
可以看到,这时候主库对 osdc_XXX 表进行了 update 操作。
在 rpl_slave.cc 里可以看到:
----
/* MTS technical limitation no support of trans retry */
if (mi->rli->opt_slave_parallel_workers != 0 && slave_trans_retries != 0
----
这个 bug 很早就由 Yoshinori Matsunobu 提在 https://bugs.mysql.com/bug.php?id=68465 了,官方也采纳了,并于 MySQL 5.7.5 修复掉了该问题。
Yoshinori Matsunobu 就是大名鼎鼎的 MHA 的作者!
[root@localhost][(none)]> show variables like '%slave_parallel_workers%';
+------------------------+-------+
| Variable_name | Value |
+------------------------+-------+
| slave_parallel_workers | 5 |
+------------------------+-------+
[root@localhost][(none)]> show variables like '%innodb_lock_wait_timeout%';
+--------------------------+-------+
| Variable_name | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 50 |
+--------------------------+-------+
[root@localhost][none]> show variables like '%slave_transaction_retries%';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| slave_transaction_retries | 10 |
+---------------------------+-------+
[root@localhost][bug]> select * from repl_test;
+------+-------+
| id | info |
+------+-------+
| 1 | tom |
| 2 | jerry |
+------+-------+
从库:
[root@localhost][bug]> begin;update repl_test set info='all';
Query OK, 0 rows affected (0.00 sec)
Query OK, 2 rows affected (0.01 sec)
Rows matched: 2 Changed: 2 Warnings: 0
显性开启事务,让它持续持有 InnoDB lock
主库:
[root@localhost][bug]> update repl_test set info='all' where id=1;commit;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
Query OK, 0 rows affected (0.00 sec)
手工更新 repl_test 表的一条数据,此时从库的复制线程会等待这条sql的应用,并处于 system lock 状态:
从库:
[root@localhost][(none)]> show processlist;
+----+-----------------+-----------+------+---------+------+-----------------------------------------------------------------------------+--------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-----------------+-----------+------+---------+------+-----------------------------------------------------------------------------+--------------------------------------------+
| 6 | event_scheduler | localhost | NULL | Daemon | 545 | Waiting on empty queue | NULL |
| 9 | root | localhost | bug | Sleep | 10 | | NULL |
| 15 | system user | | NULL | Connect | 44 | Waiting for master to send event | NULL |
| 16 | system user | | NULL | Connect | 5 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL |
| 17 | system user | | NULL | Connect | 44 | Waiting for an event from Coordinator | NULL |
| 18 | system user | | NULL | Connect | 44 | Waiting for an event from Coordinator | NULL |
| 19 | system user | | NULL | Connect | 5 | System lock | update repl_test set info='all' where id=1 |
| 20 | root | localhost | NULL | Query | 0 | init | show processlist |
+----+-----------------+-----------+------+---------+------+-----------------------------------------------------------------------------+--------------------------------------------+
8 rows in set (0.04 sec)
错误日志开始打印错误信息:
2020-08-20 11:01:00 22391 [Warning] Slave SQL: Worker 2 failed executing transaction '' at master log mysql-bin.000272, end_log_pos 654587717; Could not execute Update_rows event on table bug.repl_test; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 654587717, Error_code: 1205
2020-08-20 11:01:00 22391 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2020-08-20 11:01:00 22391 [Note] Error reading relay log event: slave SQL thread was killed
报的错与当时的内容一致。
mysql> show open tables where in_use >0 ;
+----------+-----------+--------+-------------+
| Database | Table | In_use | Name_locked |
+----------+-----------+--------+-------------+
| osdc | osdc_XXX | 1 | 0 |
+----------+-----------+--------+-------------+
1 row in set (0.00 sec)
全文完。
Enjoy MySQL :)
叶老师的「MySQL核心优化」大课已升级到MySQL 8.0,扫码开启MySQL 8.0修行之旅吧