mariadb_recovery is prone to data loss

Bug #1682153 reported by Sam Yaple
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Released
Critical
Tudosoiu Marian

Bug Description

While the mariadb_recovery playbook has never been the shining pinnacle of Galera recovery that it could be, it is in a pretty bad state right now.

The playbook in the past was "supply recovery host or it will recover the first host in the mariadb group (mariadb[0])". Now it is attempting to be a bit smarter and properly read the grastate.dat, but not accounting for non-graceful shutdowns.

A patch added prior to Newton reads the grastate.dat and tries to parse it for the highest seqno. The data-loss scenario is when you have shutdown a node gracefully, then some time passes, then your cluster crashes. This isn't uncommon and shouldn't be dismissed. When that happens the playbooks will choose the old, gracefully shutdown node and stomp the data on the rest of the nodes. This is all done without user interaction and is exceedingly dangerous since no backup is done either.

The proper recovery method that works automated a good portion of the time is as follows:

* Check if all mariadb nodes are stopped
    * if not stopped then do not recover
* Check if any mariadb nodes have gvwstate.dat
    * if gvwstate.dat found start *only* the nodes with gvwstate.dat without special options
        * This is not a garaunteed recovery, but it is a safe action (no data loss can occur)
* Check if any mariadb nodes have grastate.dat
    * If grastate.dat exists and has a seqno of -1 on any node, it is not safe to autorecover, abort
    * if no -1 exists on any node, bootstrap the node with the highest seqno

This will cover a good chunk of the failure scenarios, including graceful shutdowns and full cluster outages without any risk of data. If it cannot automatically recover then the user should be *forced* to supply a bootstrap node on the command line based on whatever critera they want (sometimes guessing, but guessing should be done by the user, not Kolla-Ansible).

Changed in kolla-ansible:
importance: Undecided → Critical
status: New → Confirmed
Duong Ha-Quang (duonghq)
Changed in kolla-ansible:
milestone: none → pike-2
Changed in kolla-ansible:
milestone: pike-2 → pike-3
Changed in kolla-ansible:
milestone: pike-3 → pike-rc1
Revision history for this message
zongyimin (yanpeifei) wrote :

I also meet it.

Changed in kolla-ansible:
milestone: pike-rc1 → pike-rc2
milestone: pike-rc2 → queens-1
Changed in kolla-ansible:
milestone: queens-2 → queens-3
Changed in kolla-ansible:
assignee: nobody → Tudosoiu Marian (mtudosoiu)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/531115

Changed in kolla-ansible:
status: Confirmed → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master)

Change abandoned by Tudosoiu Marian (marian.tudosoiu@1and1.ro) on branch: master
Review: https://review.openstack.org/531115

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/531122

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.openstack.org/532509

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/532515

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla-ansible (master)

Change abandoned by Tudosoiu Marian (marian.tudosoiu@1and1.ro) on branch: master
Review: https://review.openstack.org/532509

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Tudosoiu Marian (marian.tudosoiu@1and1.ro) on branch: master
Review: https://review.openstack.org/532515

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.openstack.org/531122
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=cead8ec6235bfd4f8d36efcbab1f3d0a288cbd96
Submitter: Zuul
Branch: master

commit cead8ec6235bfd4f8d36efcbab1f3d0a288cbd96
Author: Marian Tudosoiu <marian.tudosoiu@1and1.ro>
Date: Thu Jan 4 12:32:12 2018 +0200

    Rework mariadb recovery tasks

    In recover_cluster.yaml playbook the task to find the highest
    seqno/Global Transaction ID is no longer relying only on grastate.dat
    Instead it now follows the recommendations from galera cluster website
    http://galeracluster.com/documentation-webpages/restartingcluster.html

    Closes-Bug: 1682153

    Change-Id: I5fc3eaa8baee659576c4c39aef9cfd351c8e9af7

Changed in kolla-ansible:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/539632

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (master)

Reviewed: https://review.openstack.org/539628
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=465bc9ee1c9324ba95b223f7b8d11bc0cd376608
Submitter: Zuul
Branch: master

commit 465bc9ee1c9324ba95b223f7b8d11bc0cd376608
Author: Alexandru Bogdan Pica <alexandru.pica@1and1.ro>
Date: Wed Jan 31 20:27:38 2018 +0200

    Improve mariadb_recovery

    The purpose of this change is to improve upon
    https://review.openstack.org/#/c/531122/

    - Moved vars inside the defaults/main.yml file
    - Made the regex for the lineinfile safer

    Change-Id: Id581c0b36f3d4bd61d3627b8364b79296b967387
    Closes-Bug: 1746567
    Related-Bug: 1682153

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/pike)

Reviewed: https://review.openstack.org/539632
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=283544ebd6c991a9adb4e3e68292b5b40ab8f55e
Submitter: Zuul
Branch: stable/pike

commit 283544ebd6c991a9adb4e3e68292b5b40ab8f55e
Author: Marian Tudosoiu <marian.tudosoiu@1and1.ro>
Date: Thu Jan 4 12:32:12 2018 +0200

    Rework mariadb recovery tasks

    In recover_cluster.yaml playbook the task to find the highest
    seqno/Global Transaction ID is no longer relying only on grastate.dat
    Instead it now follows the recommendations from galera cluster website
    http://galeracluster.com/documentation-webpages/restartingcluster.html

    Closes-Bug: 1682153

    Change-Id: I5fc3eaa8baee659576c4c39aef9cfd351c8e9af7
    (cherry picked from commit cead8ec6235bfd4f8d36efcbab1f3d0a288cbd96)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/540940

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/pike)

Reviewed: https://review.openstack.org/540940
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=37fa7fcfb06bffffe9b0f0f801ec5d32c470c4d5
Submitter: Zuul
Branch: stable/pike

commit 37fa7fcfb06bffffe9b0f0f801ec5d32c470c4d5
Author: Alexandru Bogdan Pica <alexandru.pica@1and1.ro>
Date: Wed Jan 31 20:27:38 2018 +0200

    Improve mariadb_recovery

    The purpose of this change is to improve upon
    https://review.openstack.org/#/c/531122/

    - Moved vars inside the defaults/main.yml file
    - Made the regex for the lineinfile safer

    Change-Id: Id581c0b36f3d4bd61d3627b8364b79296b967387
    Closes-Bug: 1746567
    Related-Bug: 1682153
    (cherry picked from commit 465bc9ee1c9324ba95b223f7b8d11bc0cd376608)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 6.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 6.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 5.0.2

This issue was fixed in the openstack/kolla-ansible 5.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.