pt-table-checksum deadlock

Bug #1287253 reported by Daniel Durand
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona Toolkit moved to https://jira.percona.com/projects/PT
Fix Released
Medium
Frank Cizmich

Bug Description

Percona Toolkik 2.2.6

We currently using pt-table-checksum to check data integrity on all the oltp cluster, we have 6 servers there with 8k qps and we start to get deadlocks on one of the biggest tables when this tool try to get the checksum on some chunks. I already drop chunk time, but it doesn't fix completly the problem :-/.

Command Line used:

pt-table-checksum --host=sql1v-ci.ci.prd.******** --user=pt_checksum --password=******* --nocheck-binlog-format --check-interval=1 --check-plan --check-replication-filters --check-slave-tables --chunk-size-limit=1 --chunk-time="0.025" --create-replicate-table --databases="tewn" --tables="landing_page_analytics" --empty-replicate-table --max-lag="1s" --max-load="Threads_running=40" --progress="time,30" --replicate="pt_data.checksums" --retries=5

Checksumming tewn.landing_page_analytics: 1% 25:28 remain
Checksumming tewn.landing_page_analytics: 4% 22:42 remain
Checksumming tewn.landing_page_analytics: 6% 20:59 remain
Checksumming tewn.landing_page_analytics: 8% 20:35 remain
Checksumming tewn.landing_page_analytics: 10% 21:59 remain
Checksumming tewn.landing_page_analytics: 12% 20:42 remain
Checksumming tewn.landing_page_analytics: 14% 20:39 remain
Checksumming tewn.landing_page_analytics: 16% 19:50 remain
Checksumming tewn.landing_page_analytics: 18% 20:04 remain
Checksumming tewn.landing_page_analytics: 19% 21:14 remain
Checksumming tewn.landing_page_analytics: 21% 20:21 remain
Checksumming tewn.landing_page_analytics: 23% 19:27 remain
Checksumming tewn.landing_page_analytics: 25% 18:53 remain
Checksumming tewn.landing_page_analytics: 28% 18:15 remain
Checksumming tewn.landing_page_analytics: 30% 17:41 remain
Checksumming tewn.landing_page_analytics: 32% 16:52 remain
Checksumming tewn.landing_page_analytics: 34% 16:30 remain
Checksumming tewn.landing_page_analytics: 36% 15:42 remain
Checksumming tewn.landing_page_analytics: 39% 15:29 remain
Checksumming tewn.landing_page_analytics: 41% 14:53 remain
Checksumming tewn.landing_page_analytics: 43% 14:15 remain
Checksumming tewn.landing_page_analytics: 45% 13:33 remain
Checksumming tewn.landing_page_analytics: 48% 12:50 remain
Checksumming tewn.landing_page_analytics: 50% 12:26 remain
Checksumming tewn.landing_page_analytics: 52% 11:43 remain
Checksumming tewn.landing_page_analytics: 54% 11:15 remain
Checksumming tewn.landing_page_analytics: 55% 11:04 remain
Checksumming tewn.landing_page_analytics: 57% 10:53 remain
02-28T15:30:44 Error checksumming table tewn.landing_page_analytics: Error executing checksum query: DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction [for Statement "REPLACE INTO `pt_data`.`checksums` (db, tbl, chunk, chunk_index, lower_boundary, upper_boundary, this_cnt, this_crc) SELECT ?, ?, ?, ?, ?, ?, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', `id`, `domain_bitfield`, `page_id`, `category`, `demographic_sex`, `demographic_age`, `content_rating`, `num_ctr`, `num_display`, `weight`, `weight_vw`, `created`, `visible`, `type`, `key`, `updated` + 0, CONCAT(ISNULL(`domain_bitfield`), ISNULL(`page_id`), ISNULL(`num_ctr`), ISNULL(`num_display`), ISNULL(`weight`), ISNULL(`created`)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM `tewn`.`landing_page_analytics` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= ?)) AND ((`id` <= ?)) /*checksum chunk*/" with ParamValues: 0='tewn', 1='landing_page_analytics', 2=16297, 3='PRIMARY', 4='159650735', 5='159662859', 6='159650735', 7='159662859'] at /usr/bin/pt-table-checksum line 10459.

            TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE
02-28T15:30:44 1 0 80558617 16297 0 886.019 tewn.landing_page_analytics

May suggest something, I was checking this script and MySQL suggest when you have this kind of deadlock you must to retry to execute again this sentence, but in pt-table-checksum script line 10510 we have "fail" function which is evaluating kind of error so, retry argument option for the tool will retry to execute the last sql statement, like I said MySQL suggest to retry to execute, so with this modification this could be fixed.

      fail => sub {
         my (%args) = @_;
         my $error = $args{error};

         if ( $error =~ m/Lock wait timeout exceeded/
             || $error =~ m/Query execution was interrupted/
             || $error =~ m/Deadlock found when trying to get lock/
         ) {
             return 1;
         }

Related branches

Revision history for this message
Daniel Durand (daniel-durand) wrote :
Revision history for this message
Daniel Durand (daniel-durand) wrote :

Patch for this problem

Changed in percona-toolkit:
milestone: none → 2.2.10
status: New → Fix Committed
assignee: nobody → Frank Cizmich (frank-cizmich)
importance: Undecided → Medium
Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Fixed using Durand's patch.

Revision history for this message
Daniel Durand (daniel-durand) wrote :

Thank you Frank, I saw this problem is also happening with the other tools, it should have the same condition when this error happens " || $error =~ m/Deadlock found when trying to get lock/ "

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Hi Daniel D,

Can you point out which other tools miss this, and what version of PT are you using?
It seems to be already covered in pt-online-schema-change and others in the current version.
By the way , pt-online-schema-change matches the shortened /Deadlock found/ , so we shortened it in pt-table-checksum too.

Regards

Revision history for this message
Frank Cizmich (frank-cizmich) wrote :

Sorry, I now see you mentioned 2.2.6 right of the bat.

Still, pt-osc, pt-archiver and pt-heartbeat catch that error in that version.

Revision history for this message
Daniel Durand (daniel-durand) wrote : Re: [Bug 1287253] Re: pt-table-checksum deadlock
Download full text (5.5 KiB)

Hi Frank, let me check to be sure which one I saw it, what I remember it
was pt_table_sync but let me take a look, don't remember in this moment
when I check this was 3 or 4 months ago... thanks

2014-07-30 9:22 GMT-07:00 Frank Cizmich <email address hidden>:

> Sorry, I now see you mentioned 2.2.6 right of the bat.
>
> Still, pt-osc, pt-archiver and pt-heartbeat catch that error in that
> version.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1287253
>
> Title:
> pt-table-checksum deadlock
>
> Status in Percona Toolkit:
> Fix Committed
>
> Bug description:
> Percona Toolkik 2.2.6
>
> We currently using pt-table-checksum to check data integrity on all
> the oltp cluster, we have 6 servers there with 8k qps and we start to
> get deadlocks on one of the biggest tables when this tool try to get
> the checksum on some chunks. I already drop chunk time, but it doesn't
> fix completly the problem :-/.
>
> Command Line used:
>
> pt-table-checksum --host=sql1v-ci.ci.prd.********
> --user=pt_checksum --password=******* --nocheck-binlog-format
> --check-interval=1 --check-plan --check-replication-filters
> --check-slave-tables --chunk-size-limit=1 --chunk-time="0.025"
> --create-replicate-table --databases="tewn"
> --tables="landing_page_analytics" --empty-replicate-table --max-
> lag="1s" --max-load="Threads_running=40" --progress="time,30"
> --replicate="pt_data.checksums" --retries=5
>
> Checksumming tewn.landing_page_analytics: 1% 25:28 remain
> Checksumming tewn.landing_page_analytics: 4% 22:42 remain
> Checksumming tewn.landing_page_analytics: 6% 20:59 remain
> Checksumming tewn.landing_page_analytics: 8% 20:35 remain
> Checksumming tewn.landing_page_analytics: 10% 21:59 remain
> Checksumming tewn.landing_page_analytics: 12% 20:42 remain
> Checksumming tewn.landing_page_analytics: 14% 20:39 remain
> Checksumming tewn.landing_page_analytics: 16% 19:50 remain
> Checksumming tewn.landing_page_analytics: 18% 20:04 remain
> Checksumming tewn.landing_page_analytics: 19% 21:14 remain
> Checksumming tewn.landing_page_analytics: 21% 20:21 remain
> Checksumming tewn.landing_page_analytics: 23% 19:27 remain
> Checksumming tewn.landing_page_analytics: 25% 18:53 remain
> Checksumming tewn.landing_page_analytics: 28% 18:15 remain
> Checksumming tewn.landing_page_analytics: 30% 17:41 remain
> Checksumming tewn.landing_page_analytics: 32% 16:52 remain
> Checksumming tewn.landing_page_analytics: 34% 16:30 remain
> Checksumming tewn.landing_page_analytics: 36% 15:42 remain
> Checksumming tewn.landing_page_analytics: 39% 15:29 remain
> Checksumming tewn.landing_page_analytics: 41% 14:53 remain
> Checksumming tewn.landing_page_analytics: 43% 14:15 remain
> Checksumming tewn.landing_page_analytics: 45% 13:33 remain
> Checksumming tewn.landing_page_analytics: 48% 12:50 remain
> Checksumming tewn.landing_page_analytics: 50% 12:26 remain
> Checksumming tewn.landing_page_analytics: 52% 11:43 remain
> Checksumming tewn.landing_pag...

Read more...

Changed in percona-toolkit:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PT-641

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.