pt-table-checksum doesn't reconnect the slave $dbh

Bug #1042727 reported by Baron Schwartz
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
High
Daniel Nichter

Bug Description

When replication is very delayed, pt-table-checksum will not keep its connection to the replica [was:master] alive, and when the replica catches up or if it dies for some reason, we get an error. It looks like this:

================

08-27T09:44:10 Error waiting for the last checksum of table <...> to replicate to replica <...>: DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement "SELECT MAX(chunk) FROM `percona`.`checksum` WHERE ... at pt-table-checksum line 8581.

Check that the replica is running and has the replicate table `percona`.`checksum`. Checking the replica for checksum differences will probably cause another error.
08-27T09:44:10 Error checking for checksum differences of table <...> on replica <...>: DBD::mysql::db selectall_arrayref failed: MySQL server has gone away [for Statement "SELECT CONCAT(db, '.', tbl) AS `table`, chunk, chunk_index, lower_boundary, upper_boundary, COALESCE(this_cnt-master_cnt, 0) AS cnt_diff, COALESCE(this_crc <> master_crc OR ISNULL(master_crc) <> ISNULL(this_crc), 0) AS crc_diff, this_cnt, master_cnt, this_crc, master_crc FROM `rkdb`.`archivechecksum` WHERE (master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) AND (db='...' AND tbl='...')"] at pt-table-checksum line 4118.

Check that the replica is running and has the replicate table `percona`.`checksum`.

================

I think the tool needs to reconnect to replicas.

[redacted: I think the tool needs to do a keepalive SELECT 1 or something like that.]

Revision history for this message
Brian Fraser (fraserbn) wrote :

I wonder what would happen if, instead of keeping the connection alive, we used $dbh->{mysql_auto_reconnect} = 1. Does anyone have any experience with that?

Changed in percona-toolkit:
status: New → Confirmed
Revision history for this message
Baron Schwartz (baron-xaprb) wrote :

I am skeptical. Statement handles would be invalidated, I assume. But it may work.

In the meantime I am changing my local copy to do two things:

1. Don't print those warnings if --quiet =1
2. Wrap "$diffs = $rc->find_replication_differences(...)" in an eval{} block so that the whole thing doesn't get aborted if only one slave's connection has died.

Revision history for this message
Baron Schwartz (baron-xaprb) wrote :

By the way, it seems that every time I get the above messages, it's because checking on one slave failed, the tool aborts checksumming and/or never checks anything on that replica again, then tries to check for differences before exiting -- but it tries to use a $dbh it has been ignoring because it was dead. I never get one or the other error message, I always get both.

Revision history for this message
Baron Schwartz (baron-xaprb) wrote :

I'm trying this to see what happens. I'll let you know:

Index: utils/pt/pt-table-checksum
===================================================================
--- utils/pt/pt-table-checksum (revision 29726)
+++ utils/pt/pt-table-checksum (working copy)
@@ -216,6 +216,9 @@
       mysql_enable_utf8 => ($cxn_string =~ m/charset=utf8/i ? 1 : 0),
    };
    @{$defaults}{ keys %$opts } = values %$opts;
+ if ( $opts{AutoCommit} ) {
+ $opts{mysql_auto_reconnect} = 1;
+ }

    if ( $opts->{mysql_use_result} ) {
       $defaults->{mysql_use_result} = 1;

summary: - pt-table-checksum doesn't keep the master DBH alive
+ pt-table-checksum doesn't reconnect the slave $dbh
description: updated
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :

Another case of connection resiliency à la bug 1046966.

tags: added: error-recovery
Revision history for this message
Tibor Korocz (tkorocz) wrote :

Hi,

I'm using the newest pt-table-checksum but I got the same error:

11-18T18:09:35 Error waiting for the last checksum of table db.tbl to replicate to replica HOST : DBD::mysql::db selectrow_array failed: MySQL server has gone away [for Statement "SELECT MAX(chunk) FROM `db`.`checksums` WHERE db='db' AND tbl='tbl' AND master_crc IS NOT NULL"] at /usr/bin/pt-table-checksum line 11230.

Check that the replica is running and has the replicate table `db`.`checksums`. Checking the replica for checksum differences will probably cause another error.

Anybody has any solution for this?

Thanks.

Changed in percona-toolkit:
status: Confirmed → In Progress
assignee: nobody → Daniel Nichter (daniel-nichter)
importance: Undecided → High
Revision history for this message
Daniel Nichter (daniel-nichter) wrote :
Changed in percona-toolkit:
status: In Progress → Fix Committed
Changed in percona-toolkit:
milestone: none → 2.2.15
Changed in percona-toolkit:
status: Fix Committed → Fix Released
Changed in percona-toolkit:
importance: High → Medium
importance: Medium → High
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-329

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.