Bug #1510342 “Reconstructor does not restore a fragment to a han...” : Bugs : OpenStack Object Storage (swift)

Naoto Nishizono (nishizono-naoto) on 2015-10-27

description:

updated

Naoto Nishizono (nishizono-naoto) on 2015-10-27

description:

updated

Revision history for this message

clayg (clay-gerrard) wrote on 2015-11-03:

#1

yeah, I think rebuilding (on 507 *only*) would probably be more consistent with the handling of replicated objects.

Changed in swift:
importance:	Undecided → High

clayg (clay-gerrard) on 2015-11-09

tags:

added: ec

Bill Huber (wbhuber) on 2015-11-11

Changed in swift:
status:	New → Confirmed

Bill Huber (wbhuber) on 2015-11-11

Changed in swift:
assignee:	nobody → Bill Huber (wbhuber)

Revision history for this message

funny_falcon (funny-falcon) wrote on 2016-04-12:

#2

Was it fixed in new release?

Our company needs erasure coded storage, and Swift looks to be less hardware hungry than Ceph
(at least, by architecture overview; I've tested only Ceph yet).
But such serious issue is show stopper for me.

Revision history for this message

clayg (clay-gerrard) wrote on 2016-04-12:

#3

The issue is still open. Currently mitigation expects operator to monitor unmounted disks and make a ring change. I'll add this bug to the design session notes for Austin and we'll work it into prioritization with other EC work.

Revision history for this message

funny_falcon (funny-falcon) wrote on 2016-04-12:

#4

So is it intended operation mode? And it is not tied to erasure coded storage policy?

Revision history for this message

funny_falcon (funny-falcon) wrote on 2016-04-13:

#5

It looks like Ceph does almost the same:
- it doesn't start rebalance until disk marked as "out".
It could be marked as "out" either automatically or by operator.

So, may be this issue's subject is not an issue, and another issue
should be open instead:
- "add configurable automatic ring change on disk failure"
?

If this issue is not tied with erasure coding at all, then it is silly
to see it presence, cause it makes impression of EC instability.

Revision history for this message

clayg (clay-gerrard) wrote on 2016-12-03:

#6

I've changed the priority of this issue - EC works differently than Replication in this regard; but to mirror the replicated behavior would require some design work and it's not a high priority compared to other things that should be improved in EC reconstructor.

For now EC does not reconstruct fragments when a primary node responds as unmounted - fail in place strategy requires a ring change (unlink replicated, which just requires an unmount).

Maybe this bug should be reworded as a documentation or runbook issue.

Changed in swift:
importance:	High → Medium
importance:	Medium → Low

Revision history for this message

Tim Burke (1-tim-z) wrote on 2018-10-30:

#7

To provide a little more background: there's a reason "EC works differently than Replication in this regard" and we *don't* want to just blindly mirror replication. In order to over-replicate, we just have to talk to one disk: the local disk. If we wanted to over-reconstruct, we'd be impacting ec_num_data disks, which could have ripple effects in the cluster.

That said, it's definitely *not good* that failing to deal with unmounted disks can (and eventually *will*) lead to data loss. It'd be *way better* if Swift could get us back up to full durability without a ring change.

So, idea: Have the reconstructor add an xattr like user.swift.sync_times -- have it be a dict of frag index to last-sync timestamp. If we get a 507 *and* our recorded last-sync for that index is more than a day (or some configurable amount) in the past, reconstruct to a hand-off. Maybe have another config opt for whether or not to reconstruct when there's no last-sync, so we don't trigger a reconstruction storm on upgrade?

Revision history for this message

John Dickinson (notmyname) wrote on 2018-10-30:

#8

Tim, I like this idea. I'm not a fan of the "just leave it and eventually lose data" current state of things, especially when replication has very different semantics for operators. Your proposed "do it automatically, but only after a time" sounds like a good compromise. Probably not a bad idea to log a warning (?) on startup if the config trigger turns it off.

John Dickinson (notmyname) on 2018-11-07

Changed in swift:
assignee:	Bill Huber (wbhuber) → nobody

Revision history for this message

clayg (clay-gerrard) wrote on 2018-12-06:

#9

I don't think there's any reason to add a delay, if we want EC to support a fail in place strategy we should just rebuild when we get the 507

the questions is really just *where* to rebuild (imagine multiple unmounted disks, we don't want two frags rebuilt to the first handoff)

eventually someone will replace the drive and if there's a ring change they'll probably flop over to rebalance mode and have a good chance of the rebuilt fragment from the handoff making it's way over to the new primary instead of being rebuilt again

Changed in swift:
importance:	Low → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-09: Fix merged to swift (master)

#10

Reviewed: https://review.openstack.org/629056
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=ea8e545a27f06868323ff91c1584d18ab9ac6cda
Submitter: Zuul
Branch: master

commit ea8e545a27f06868323ff91c1584d18ab9ac6cda
Author: Clay Gerrard <email address hidden>
Date: Mon Feb 4 15:46:40 2019 -0600

Rebuild frags for unmounted disks

    Change the behavior of the EC reconstructor to perform a fragment
    rebuild to a handoff node when a primary peer responds with 507 to the
    REPLICATE request.

    Each primary node in a EC ring will sync with exactly three primary
    peers, in addition to the left & right nodes we now select a third node
    from the far side of the ring. If any of these partners respond
    unmounted the reconstructor will rebuild it's fragments to a handoff
    node with the appropriate index.

    To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
    we must give the remote handoff node the correct backend_index for the
    fragments it will recieve. In the common case we will use
    determistically different handoffs for each fragment index to prevent
    multiple unmounted primary disks from forcing a single handoff node to
    hold more than one rebuilt fragment.

    Handoff nodes will continue to attempt to revert rebuilt handoff
    fragments to the appropriate primary until it is remounted or
    rebalanced. After a rebalance of EC rings (potentially removing
    unmounted/failed devices), it's most IO efficient to run in
    handoffs_only mode to avoid unnecessary rebuilds.

Closes-Bug: #1510342

Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec

Changed in swift:
status:	Confirmed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-15: Fix proposed to swift (feature/losf)

#11

Fix proposed to branch: feature/losf
Review: https://review.openstack.org/637142

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-02-15: Fix merged to swift (feature/losf)

#12

Download full text (4.6 KiB)

Reviewed: https://review.openstack.org/637142
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=b68bef5cd80e2a1b71a1ef544e122b39dfa7ac57
Submitter: Zuul
Branch: feature/losf

commit 926a024135d380999d9f8494b19b59bb87a7f5b6
Author: Tim Burke <email address hidden>
Date: Thu Feb 14 21:02:01 2019 +0000

Fix up flakey TestContainer.test_PUT_bad_metadata

Change-Id: I7489f2bb95c27d1ddd5e8fa7e5786904100fb567

commit 002d21991e100ee6199e79679ae990c96ea05730
Author: Tim Burke <email address hidden>
Date: Wed Feb 13 17:02:08 2019 +0000

Make get_data/async/tmp_dir explicit

functools.partial is all well and good in code, but apparently it
doesn't play real well with docs.

Change-Id: Ia460473af9038d890346502784e3cf4d0e1d1c40

commit ac01d186b44856385a13fa77ecf527238c803443
Author: Pete Zaitcev <email address hidden>
Date: Mon Feb 11 21:42:34 2019 -0600

Leave less garbage in /var/tmp

    All our tests that invoked broker.set_sharding_state() created
    /var/tmp/tmp, when it called DatabaseBroker.get_device_path(),
    then added "tmp" to it. We provided 1 less level, so it walked up
    ouside of the test's temporary directory.

The case of "cleanUp" instead of "tearDown" didn't break out of
jail, but left trash in /var/tmp all the same.

Change-Id: I8030ea49e2a977ebb7048e1d5dcf17338c1616df

commit bb1a2d45685a3b2230f21f7f6ff0e998e666723e
Author: Tim Burke <email address hidden>
Date: Fri Jul 27 20:03:36 2018 +0000

Display crypto data/metadata details in swift-object-info

Change-Id: If577c69670a10decdbbf5331b1a38d9392d12711

commit ea8e545a27f06868323ff91c1584d18ab9ac6cda
Author: Clay Gerrard <email address hidden>
Date: Mon Feb 4 15:46:40 2019 -0600

Rebuild frags for unmounted disks

    Change the behavior of the EC reconstructor to perform a fragment
    rebuild to a handoff node when a primary peer responds with 507 to the
    REPLICATE request.

    Each primary node in a EC ring will sync with exactly three primary
    peers, in addition to the left & right nodes we now select a third node
    from the far side of the ring. If any of these partners respond
    unmounted the reconstructor will rebuild it's fragments to a handoff
    node with the appropriate index.

    To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
    we must give the remote handoff node the correct backend_index for the
    fragments it will recieve. In the common case we will use
    determistically different handoffs for each fragment index to prevent
    multiple unmounted primary disks from forcing a single handoff node to
    hold more than one rebuilt fragment.

    Handoff nodes will continue to attempt to revert rebuilt handoff
    fragments to the appropriate primary until it is remounted or
    rebalanced. After a rebalance of EC rings (potentially removing
    unmounted/failed devices), it's most IO efficient to run in
    handoffs_only mode to avoid unnecessary rebuilds.

Closes-Bug: #1510342

Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec

commit 8a6159f67b6a3e7917e68310e4c24aae81...

Reviewed:  https://review.openstack.org/637142
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=b68bef5cd80e2a1b71a1ef544e122b39dfa7ac57
Submitter: Zuul
Branch:    feature/losf

commit 926a024135d380999d9f8494b19b59bb87a7f5b6
Author: Tim Burke <tim.burke@gmail.com>
Date:   Thu Feb 14 21:02:01 2019 +0000

Fix up flakey TestContainer.test_PUT_bad_metadata
    
    Change-Id: I7489f2bb95c27d1ddd5e8fa7e5786904100fb567

commit 002d21991e100ee6199e79679ae990c96ea05730
Author: Tim Burke <tim.burke@gmail.com>
Date:   Wed Feb 13 17:02:08 2019 +0000

Make get_data/async/tmp_dir explicit
    
    functools.partial is all well and good in code, but apparently it
    doesn't play real well with docs.
    
    Change-Id: Ia460473af9038d890346502784e3cf4d0e1d1c40

commit ac01d186b44856385a13fa77ecf527238c803443
Author: Pete Zaitcev <zaitcev@kotori.zaitcev.us>
Date:   Mon Feb 11 21:42:34 2019 -0600

Leave less garbage in /var/tmp
    
    All our tests that invoked broker.set_sharding_state() created
    /var/tmp/tmp, when it called DatabaseBroker.get_device_path(),
    then added "tmp" to it. We provided 1 less level, so it walked up
    ouside of the test's temporary directory.
    
    The case of "cleanUp" instead of "tearDown" didn't break out of
    jail, but left trash in /var/tmp all the same.
    
    Change-Id: I8030ea49e2a977ebb7048e1d5dcf17338c1616df

commit bb1a2d45685a3b2230f21f7f6ff0e998e666723e
Author: Tim Burke <tim.burke@gmail.com>
Date:   Fri Jul 27 20:03:36 2018 +0000

Display crypto data/metadata details in swift-object-info
    
    Change-Id: If577c69670a10decdbbf5331b1a38d9392d12711

commit ea8e545a27f06868323ff91c1584d18ab9ac6cda
Author: Clay Gerrard <clay.gerrard@gmail.com>
Date:   Mon Feb 4 15:46:40 2019 -0600

Rebuild frags for unmounted disks
    
    Change the behavior of the EC reconstructor to perform a fragment
    rebuild to a handoff node when a primary peer responds with 507 to the
    REPLICATE request.
    
    Each primary node in a EC ring will sync with exactly three primary
    peers, in addition to the left & right nodes we now select a third node
    from the far side of the ring.  If any of these partners respond
    unmounted the reconstructor will rebuild it's fragments to a handoff
    node with the appropriate index.
    
    To prevent ssync (which is uninterruptible) receiving a 409 (Conflict)
    we must give the remote handoff node the correct backend_index for the
    fragments it will recieve.  In the common case we will use
    determistically different handoffs for each fragment index to prevent
    multiple unmounted primary disks from forcing a single handoff node to
    hold more than one rebuilt fragment.
    
    Handoff nodes will continue to attempt to revert rebuilt handoff
    fragments to the appropriate primary until it is remounted or
    rebalanced.  After a rebalance of EC rings (potentially removing
    unmounted/failed devices), it's most IO efficient to run in
    handoffs_only mode to avoid unnecessary rebuilds.
    
    Closes-Bug: #1510342
    
    Change-Id: Ief44ed39d97f65e4270bf73051da9a2dd0ddbaec

commit 8a6159f67b6a3e7917e68310e4c24aae819fe187
Author: Tim Burke <tim.burke@gmail.com>
Date:   Fri Feb 8 09:36:35 2019 -0800

Stop using duplicate dev IDs in write_fake_ring
    
    This would cause some weird issues where get_more_nodes() would actually
    yield out something, despite us only having two drives.
    
    Change-Id: Ibf658d69fce075c76c0870a542348f220376c87a

commit 43103319d0aa27f24e6520c0962bd19e55568ad4
Author: Tim Burke <tim.burke@gmail.com>
Date:   Wed Feb 6 16:48:17 2019 -0800

encryption: Stop being cutesy with os.path.join()
    
    Turns out, we *care* about the path, and object paths *don't follow
    filesystem semantics*!
    
    Be explicit: /<account>/<container>/<object>
    
    Bump the key version number so we know whether we can trust the full path
    or not.
    
    Change-Id: Ide9d44cc18575306363126a93d91f662c6ee23e0
    Related-Bug: 1813725

commit b2b1f96b0eb2062676f01ef46ec12f6badc73ff3
Author: Tim Burke <tim.burke@gmail.com>
Date:   Thu Jan 31 13:54:04 2019 -0800

docs: clean up Object Versioning page
    
    Before, there were some ugly blockquotes getting added for no good
    reason.
    
    Change-Id: I871ff743f0a30d2639b937f338dd37ce2eabd1f9

commit 1a51604b26e7cd1b3f3e7d30176b251501070e07
Author: Tim Burke <tim.burke@gmail.com>
Date:   Thu Nov 29 17:55:55 2018 -0800

s3api: Look for more indications of aws-chunked uploads
    
    Change-Id: I7dda8a25c9e13b0d81293f0a966c34713c93f6ad
    Related-Bug: 1810026

tags:

added: in-feature-losf

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-03-25: Fix included in openstack/swift 2.21.0

#13

This issue was fixed in the openstack/swift 2.21.0 release.

OpenStack Object Storage (swift)

Reconstructor does not restore a fragment to a handoff node

Bug Description

Other bug subscribers

Remote bug watches