EC: Durable frag_set is reclaimable in racing condition

Bug #1554378 reported by Kota Tsuyuzaki
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Confirmed
Undecided
Unassigned

Bug Description

Durable fragment set (which has both .data and .durable) seems to be able to be reclaimed (deleted in the worst case) in racing condition.

For example, if all PUTs are in older than reclaime age (i.e. request X-Timestamp << (now - reclaime_age), e.g. reconstruction PUT at reconstructor, too slow container-sync ), here, I describes the case for reconstructor:

Imagine we have 2 different timestamps with same path (overwritten) but unfortunately Swift keeps both due to brain split status and then, the split fixed.

the timestamps are t0 < t1. << (now - reclaim_age).

and now both reconstuctors attempt to fix the object.

1. a reconstructor (t0) attempt to a PUT and object-server make a t0.data file
2: another reconstructor (t1) makes a PUT and object-server makes a t1.data file.

likely:
[a reconstructor] ---(PUT t0)--> [an object-sever] <--(PUT t1) ---[another reconstructor]

These are obviously different requests but both requests touch the same path in the device and at this time, t0.data and t1.data in the object-server. After this,

3. object-server makes a t0.durable and call cleanup_list_dir. This will trigger to remove t1.data because it doesn't have .durable and older than reclaim age.
4. After that, object-server makes a t1.durable and call cleanup_list_dir will remove all .data .durable file older than the newest t1.durable.

5. finally all .data files are gone, unexpectedly.

So the state 3-4 racing condition seems to be able to happen (even if it's rarely) unless I might be missing something.

I'm still in deep dive the sequence but I'd like to welcome to help me to correct this status.

Tags: ec
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/289756

Changed in swift:
status: New → In Progress
Revision history for this message
Alistair Coles (alistair-coles) wrote :

This seems theoretically possible, but I have not confirmed in practice.

Changed in swift:
status: In Progress → Confirmed
Changed in swift:
status: Confirmed → In Progress
Revision history for this message
Alistair Coles (alistair-coles) wrote :

This may well be fixed by [1] but it is hard to prove with tests.

Brainstorming:

- in future world with [1] merged and .durables eliminated, when the reconstructor is PUTting a fragment to another node we should skip the whole two-phase renaming completely and just write a durable data file straight-away. we *just* need a way to tell the object server and diskfile to do that. *just* == maybe not so simple.

- that eliminates any window when there is a non-durable frag that the reconstructor just wrote that is vulnerable to being reclaimed. when reconstructing, there is either no file, or there is a durable-data file.

[1] https://review.openstack.org/355958

Revision history for this message
Alistair Coles (alistair-coles) wrote :
Revision history for this message
clayg (clay-gerrard) wrote :

The other patch working on this is https://review.openstack.org/#/c/389627/

Revision history for this message
clayg (clay-gerrard) wrote :

I have a swift 2.11+ (almost 2.12) that's still hitting a traceback very similar to the one in the duplicate bug:

Jan 2 05:16:44 storage-node object-server: Problem making data file durable /srv/node/d286/objects/63534/0c5/f82e383194229d3a5c444b994be410c5/1483359384.22840#15#d.data ([Errno 2] No such file or directory):
Traceback (most recent call last):
  File "/opt/swift/lib/python2.7/site-packages/swift/obj/diskfile.py", line 2679, in _finalize_durable
    os.rename(data_file_path, durable_data_file_path)
OSError: [Errno 2] No such file or directory
Jan 2 05:16:44 storage-node object-server: ERROR __call__ error with PUT /d286/63534/AUTH_USER/CONTAINER/OBJNAME :
Traceback (most recent call last):
  File "/opt/swift/lib/python2.7/site-packages/swift/obj/server.py", line 1102, in __call__
    res = getattr(self, req.method)(req)
  File "/opt/swift/lib/python2.7/site-packages/swift/common/utils.py", line 1609, in _timing_stats
    resp = func(ctrl, *args, **kwargs)
  File "/opt/swift/lib/python2.7/site-packages/swift/obj/server.py", line 824, in PUT
    writer.commit(request.timestamp)
  File "/opt/swift/lib/python2.7/site-packages/swift/obj/diskfile.py", line 2726, in commit
    self._finalize_durable, data_file_path, durable_data_file_path)
  File "/opt/swift/lib/python2.7/site-packages/swift/common/utils.py", line 3517, in tpool_reraise
    raise resp
DiskFileError: Problem making data file durable /srv/node/d286/objects/63534/0c5/f82e383194229d3a5c444b994be410c5/1483359384.22840#15#d.data ([Errno 2] No such file or directory) (txn: tx5f3fda36d1664042b78e9-00586a4498)

so, I don't think eliminating durables fixed it

Revision history for this message
clayg (clay-gerrard) wrote :

IME this is happening when a handoff part is being reverted *to* a handoff location while that location is *also* being reverted to the primary (or worse another handoff).

I think this would be a lot less common if we fixed lp bug #1653169

Revision history for this message
Alistair Coles (alistair-coles) wrote :

IMHO we need to eliminate the "two phase commit" that creates the window when a data file is not yet durable but is reclaimable (see comment #3). When an object is PUT and already known to be durable, e.g. by ssync or reconstructor, just create the .data with it's durable filename in first phase.

Revision history for this message
Romain LE DISEZ (rledisez) wrote :
Download full text (3.9 KiB)

As clayg, I still see these issues in 2.12. It's not related to reconstruction or handoff. I put some debug logs in diskfile, here is what I found.

100% of these errors happen on os.rename(data_file_path, durable_data_file_path)

In every case I analysed (~10):
- it was a request coming from proxy-server, not object-reconstructor;
- the datadir was still there, but not the datafile to rename to durable;
- there were concurrent upload for the same object, but the most recent upload was completing before the first one

Let me explain last point:
1. user start to upload object, ts=1
2. user start to upload object, ts=2
3. upload with ts=2 finishes, it is marked as durable
4. upload with ts=1 finishes, it fails to mark it as durable

Based on this observation, I tried to reproduce, without success.

Here are logs from one case (Swift + custom logs)

Mar 16 11:15:10 localhost object-server: OVH ERROR: data_file_path (/srv/node/disk-01-000/objects-1/69029/2c6/43697ca96ece87902bfefd8727af72c6/1489659239.46979#13.data) does not exist (rename: /srv/node/disk-01-000/objects-1/69029/2c6/43697ca96ece87902bfefd8727af72c6/1489659239.46979#13.data -> /srv/node/disk-01-000/objects-1/69029/2c6/43697ca96ece87902bfefd8727af72c6/1489659239.46979#13#d.data): #012Traceback (most recent call last):#012 File "/opt/swift-2.12.0-ovh20170307110844/local/lib/python2.7/site-packages/swift/obj/diskfile.py", line 2711, in _finalize_durable#012 os.rename(data_file_path, durable_data_file_path)#012OSError: [Errno 2] No such file or directory
Mar 16 11:15:10 localhost object-server: Problem making data file durable /srv/node/disk-01-000/objects-1/69029/2c6/43697ca96ece87902bfefd8727af72c6/1489659239.46979#13#d.data ([Errno 2] No such file or directory): #012Traceback (most recent call last):#012 File "/opt/swift-2.12.0-ovh20170307110844/local/lib/python2.7/site-packages/swift/obj/diskfile.py", line 2711, in _finalize_durable#012 os.rename(data_file_path, durable_data_file_path)#012OSError: [Errno 2] No such file or directory
Mar 16 11:15:10 localhost object-server: ERROR __call__ error with PUT /disk-01-000/69029/AUTH_abc/container/obj : #012Traceback (most recent call last):#012 File "/opt/swift-2.12.0-ovh20170307110844/local/lib/python2.7/site-packages/swift/obj/server.py", line 1102, in __call__#012 res = getattr(self, req.method)(req)#012 File "/opt/swift-2.12.0-ovh20170307110844/local/lib/python2.7/site-packages/swift/common/utils.py", line 1609, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/opt/swift-2.12.0-ovh20170307110844/local/lib/python2.7/site-packages/swift/obj/server.py", line 824, in PUT#012 writer.commit(request.timestamp)#012 File "/opt/swift-2.12.0-ovh20170307110844/local/lib/python2.7/site-packages/swift/obj/diskfile.py", line 2770, in commit#012 self._finalize_durable, data_file_path, durable_data_file_path)#012 File "/opt/swift-2.12.0-ovh20170307110844/local/lib/python2.7/site-packages/swift/common/utils.py", line 3517, in tpool_reraise#012 raise resp#012DiskFileError: Problem making data file durable /srv/node/disk-01-000/objects-1/69029/2c6/43697ca96ece87902bfefd8727af72c6/1489659239.46979#13#d.data ([Er...

Read more...

Revision history for this message
Alistair Coles (alistair-coles) wrote :

@Romain, I'm not sure that what you describe is the same issue - you have two clients "concurrently" making PUT to same url, only one of them wins and gets 201, and that client's fragment is durable and on disk. The bug reported here results in all .data being lost.

What you are seeing could be:

client proxy obj server
------ ----- ----------
PUT at t1 ---> 1st phase backend PUT for t1 ----> t1#<frag>.data
PUT at t2 ---> 1st phase backend PUT for t2 ----> t2#<frag>.data
                                            <-t2 ok--
                                            <-t1 ok--
                  2nd phase backend PUT for t2 ----> t2#<frag>#d.data
                                                         --> t1#<frag>.data is cleaned up
                  2nd phase backend PUT for t1 ----> rename of t1#<frag>.data fails

In your scenario, it's not great that the second (losing) client gets a 500. But (more worrying) I am curious if you see different outcomes from the race on different nodes i.e. can you end up with some frags marked durable at t1 and some marked durable at t2 for the same object on different nodes? I don't think that should happen - the t2 frag should always win and become durable consistently across the object servers.

Revision history for this message
Alistair Coles (alistair-coles) wrote :

FWIW I do not expect the elimination of .durables to have fixed this bug.

With reference to comment #3, I think the solution is to eliminate the two phase commit in the object server PUT path *when the object is known a priori to be durable* e.g. during reconstruction.

Unfortunately that is a non-trivial change to ssync and the object server - in principle there are no major obstacles, but I expect significant code churn to provide that "shortcut to durable" path.

Revision history for this message
Kota Tsuyuzaki (tsuyuzaki-kota) wrote :

This report depends on the older version of Swift EC that supports .durable file. Currently we're using rename as ts#d.data instead of creating .durable file so that this need updates if this racing condition can occur in the newest Swift EC code.

Hence set the status as INCOMPLETE.

Changed in swift:
assignee: Kota Tsuyuzaki (tsuyuzaki-kota) → nobody
status: In Progress → Incomplete
Revision history for this message
Alistair Coles (alistair-coles) wrote :

I have not reproduced this race but I am still of the opinion that it can occur

Revision history for this message
Alistair Coles (alistair-coles) wrote :

I am making this Conformed because I do not want it to expire while marked as Incomplete. In my opinion the change to marking data file durable by renaming rather than writing another file did not solve the race.

Changed in swift:
status: Incomplete → Confirmed
Revision history for this message
clayg (clay-gerrard) wrote :

I don't think SSYNC is (yet?) atomically creating durable frag during replication - I don't see any obvious way to amend the DiskFile API to even support that.

But I do think not reverting to handoffs has removed the main opportunity for this race.

Revision history for this message
Alistair Coles (alistair-coles) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.