Reconstructor does not restore a fragment to a handoff node

Bug #1510342 reported by Naoto Nishizono
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Medium
Unassigned

Bug Description

Following cases, unlike the replicator, reconstructor does not restore a fragment to a handoff node in 2.5.0.
Is this a bug, a specification, or under development?

1. PUT a object
2. Unmount the device of a primary node
3. Run reconstructor
4. Fragment which was present in 2. is not restored to a handoff node

description: updated
description: updated
Revision history for this message
clayg (clay-gerrard) wrote :

yeah, I think rebuilding (on 507 *only*) would probably be more consistent with the handling of replicated objects.

Changed in swift:
importance: Undecided → High
clayg (clay-gerrard)
tags: added: ec
Bill Huber (wbhuber)
Changed in swift:
status: New → Confirmed
Bill Huber (wbhuber)
Changed in swift:
assignee: nobody → Bill Huber (wbhuber)
Revision history for this message
funny_falcon (funny-falcon) wrote :

Was it fixed in new release?

Our company needs erasure coded storage, and Swift looks to be less hardware hungry than Ceph
(at least, by architecture overview; I've tested only Ceph yet).
But such serious issue is show stopper for me.

Revision history for this message
clayg (clay-gerrard) wrote :

The issue is still open. Currently mitigation expects operator to monitor unmounted disks and make a ring change. I'll add this bug to the design session notes for Austin and we'll work it into prioritization with other EC work.

Revision history for this message
funny_falcon (funny-falcon) wrote :

So is it intended operation mode? And it is not tied to erasure coded storage policy?

Revision history for this message
funny_falcon (funny-falcon) wrote :

It looks like Ceph does almost the same:
- it doesn't start rebalance until disk marked as "out".
It could be marked as "out" either automatically or by operator.

So, may be this issue's subject is not an issue, and another issue
should be open instead:
- "add configurable automatic ring change on disk failure"
?

If this issue is not tied with erasure coding at all, then it is silly
to see it presence, cause it makes impression of EC instability.

Revision history for this message
clayg (clay-gerrard) wrote :

I've changed the priority of this issue - EC works differently than Replication in this regard; but to mirror the replicated behavior would require some design work and it's not a high priority compared to other things that should be improved in EC reconstructor.

For now EC does not reconstruct fragments when a primary node responds as unmounted - fail in place strategy requires a ring change (unlink replicated, which just requires an unmount).

Maybe this bug should be reworded as a documentation or runbook issue.

Changed in swift:
importance: High → Medium
importance: Medium → Low
Revision history for this message
Tim Burke (1-tim-z) wrote :

To provide a little more background: there's a reason "EC works differently than Replication in this regard" and we *don't* want to just blindly mirror replication. In order to over-replicate, we just have to talk to one disk: the local disk. If we wanted to over-reconstruct, we'd be impacting ec_num_data disks, which could have ripple effects in the cluster.

That said, it's definitely *not good* that failing to deal with unmounted disks can (and eventually *will*) lead to data loss. It'd be *way better* if Swift could get us back up to full durability without a ring change.

So, idea: Have the reconstructor add an xattr like user.swift.sync_times -- have it be a dict of frag index to last-sync timestamp. If we get a 507 *and* our recorded last-sync for that index is more than a day (or some configurable amount) in the past, reconstruct to a hand-off. Maybe have another config opt for whether or not to reconstruct when there's no last-sync, so we don't trigger a reconstruction storm on upgrade?

Revision history for this message
John Dickinson (notmyname) wrote :

Tim, I like this idea. I'm not a fan of the "just leave it and eventually lose data" current state of things, especially when replication has very different semantics for operators. Your proposed "do it automatically, but only after a time" sounds like a good compromise. Probably not a bad idea to log a warning (?) on startup if the config trigger turns it off.

Changed in swift:
assignee: Bill Huber (wbhuber) → nobody
Revision history for this message
clayg (clay-gerrard) wrote :

I don't think there's any reason to add a delay, if we want EC to support a fail in place strategy we should just rebuild when we get the 507

the questions is really just *where* to rebuild (imagine multiple unmounted disks, we don't want two frags rebuilt to the first handoff)

eventually someone will replace the drive and if there's a ring change they'll probably flop over to rebalance mode and have a good chance of the rebuilt fragment from the handoff making it's way over to the new primary instead of being rebuilt again

Changed in swift:
importance: Low → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/629056
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=ea8e545a27f06868323ff91c1584d18ab9ac6cda
Submitter: Zuul
Branch: master

commit ea8e545a27f06868323ff91c1584d18ab9ac6cda
Author: Clay Gerrard <email address hidden>
Date: Mon Feb 4 15:46:40 2019 -0600

    Rebuild frags for unmounted disks

    Change the behavior of the EC reconstructor to perform a fragment
    rebuild to a handoff node when a primary peer responds with 507 to the
    REPLICATE request.

    Each primary node in a EC ring will sync with exactly three primary
    peers, in addition to the left & right nodes we now select a third node
    from the far side of the ring. If any of these partners respond
    unmounted the reconstructor will rebuild it's fragments to a handoff
    node with the appropriate index.

    To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
    we must give the remote handoff node the correct backend_index for the
    fragments it will recieve. In the common case we will use
    determistically different handoffs for each fragment index to prevent
    multiple unmounted primary disks from forcing a single handoff node to
    hold more than one rebuilt fragment.

    Handoff nodes will continue to attempt to revert rebuilt handoff
    fragments to the appropriate primary until it is remounted or
    rebalanced. After a rebalance of EC rings (potentially removing
    unmounted/failed devices), it's most IO efficient to run in
    handoffs_only mode to avoid unnecessary rebuilds.

    Closes-Bug: #1510342

    Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec

Changed in swift:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/losf)

Fix proposed to branch: feature/losf
Review: https://review.openstack.org/637142

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/losf)
Download full text (4.6 KiB)

Reviewed: https://review.openstack.org/637142
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=b68bef5cd80e2a1b71a1ef544e122b39dfa7ac57
Submitter: Zuul
Branch: feature/losf

commit 926a024135d380999d9f8494b19b59bb87a7f5b6
Author: Tim Burke <email address hidden>
Date: Thu Feb 14 21:02:01 2019 +0000

    Fix up flakey TestContainer.test_PUT_bad_metadata

    Change-Id: I7489f2bb95c27d1ddd5e8fa7e5786904100fb567

commit 002d21991e100ee6199e79679ae990c96ea05730
Author: Tim Burke <email address hidden>
Date: Wed Feb 13 17:02:08 2019 +0000

    Make get_data/async/tmp_dir explicit

    functools.partial is all well and good in code, but apparently it
    doesn't play real well with docs.

    Change-Id: Ia460473af9038d890346502784e3cf4d0e1d1c40

commit ac01d186b44856385a13fa77ecf527238c803443
Author: Pete Zaitcev <email address hidden>
Date: Mon Feb 11 21:42:34 2019 -0600

    Leave less garbage in /var/tmp

    All our tests that invoked broker.set_sharding_state() created
    /var/tmp/tmp, when it called DatabaseBroker.get_device_path(),
    then added "tmp" to it. We provided 1 less level, so it walked up
    ouside of the test's temporary directory.

    The case of "cleanUp" instead of "tearDown" didn't break out of
    jail, but left trash in /var/tmp all the same.

    Change-Id: I8030ea49e2a977ebb7048e1d5dcf17338c1616df

commit bb1a2d45685a3b2230f21f7f6ff0e998e666723e
Author: Tim Burke <email address hidden>
Date: Fri Jul 27 20:03:36 2018 +0000

    Display crypto data/metadata details in swift-object-info

    Change-Id: If577c69670a10decdbbf5331b1a38d9392d12711

commit ea8e545a27f06868323ff91c1584d18ab9ac6cda
Author: Clay Gerrard <email address hidden>
Date: Mon Feb 4 15:46:40 2019 -0600

    Rebuild frags for unmounted disks

    Change the behavior of the EC reconstructor to perform a fragment
    rebuild to a handoff node when a primary peer responds with 507 to the
    REPLICATE request.

    Each primary node in a EC ring will sync with exactly three primary
    peers, in addition to the left & right nodes we now select a third node
    from the far side of the ring. If any of these partners respond
    unmounted the reconstructor will rebuild it's fragments to a handoff
    node with the appropriate index.

    To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
    we must give the remote handoff node the correct backend_index for the
    fragments it will recieve. In the common case we will use
    determistically different handoffs for each fragment index to prevent
    multiple unmounted primary disks from forcing a single handoff node to
    hold more than one rebuilt fragment.

    Handoff nodes will continue to attempt to revert rebuilt handoff
    fragments to the appropriate primary until it is remounted or
    rebalanced. After a rebalance of EC rings (potentially removing
    unmounted/failed devices), it's most IO efficient to run in
    handoffs_only mode to avoid unnecessary rebuilds.

    Closes-Bug: #1510342

    Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec

commit 8a6159f67b6a3e7917e68310e4c24aae81...

Read more...

tags: added: in-feature-losf
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.21.0

This issue was fixed in the openstack/swift 2.21.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.