Comment 1 for bug 1655608

Revision history for this message
clayg (clay-gerrard) wrote :

I knew about this. I remember chatting with acoles about it (in #openstack-swift on Freenode?) while working on https://review.openstack.org/#/c/385609/ and the other related bugs. However; I don't remember filing a bug for this, and can't find anything. So thanks!

This is sort of the equivalent of EC data data - except instead of the whole object it's just a piece - and instead of happily and silently repopulating the object on all nodes you get annoying messages in your logs FOREVER.

The work around is two fold:

1) don't reintroduce nodes after reclaim age - because that makes dark data

2) use a time machine to not run EC policies with swift < 2.11 because there's since fixed bugs with the reconstructor that can prevent *any* progress which leads to out of date parts/suffixes (it was effectively like reconstructors had been off months on end and then when you upgrade it's totally possible tombstones on handoff nodes get reaped instead of clearing out these orphaned frags)

The fix is not obvious to me :\

I'm also not sure on the priority.

IIRC the "Unable to get enough responses" message just causes the reconstructor to move onto the next hash in the suffix without disrupting the ssync protocol. If that triage is incorrect it's probably HIGH or CRITICAL until we can find a workaround. We need to make sure the reconstructor can make *other* progress even if these frags are unprocessable.

As long as the reconstructor is otherwise making progress I think we can leave it at MEDIUM or LOW priority until we grow some more braincells... once a cluster is fully upgraded and rebalanced it seems manageable that you could even just extract offending object names from the logs and script an audit by hand. Blasting out tombstones over these names with a superadmin/reseller token would totally clean them up.