Failed to ssync object repeatedly

Bug #2007643 reported by Chieh Chih Chang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Critical
Alistair Coles

Bug Description

I'm solving a misplaced partition, but I get 500 error code multiple times.

Misplaced on node : d655
$ sudo /opt/ss/bin/python3 /opt/ss/support/bin/classify_handoff_parts.py all
   device: primary handoff( suffix) misplaced( suffix)
   .....
   d655: 0 0( 0) 1( 4089)

Get 500 error from remote node
$ /opt/ss/bin/python3.7 /opt/ss/bin/swift-object-reconstructor /etc/swift/object-server/a.conf -d d655 -v -o
object-reconstructor: Starting 176727
object-reconstructor: [worker 1/1 pid=176738] Running object reconstructor in script mode.
object-reconstructor: [worker 1/1 pid=176738] X.X.X.X:6003/d38782/229849 Unexpected response: ":ERROR: 500 'ERROR: With :UPDATES: 1 failures to 0 successes'"
object-reconstructor: [worker 1/1 pid=176738] 1/3 (33.33%) partitions reconstructed in 10.93s (0.09/sec, 21s remaining)
object-reconstructor: [worker 1/1 pid=176738] Object reconstruction complete (once). (0.18 minutes)
object-reconstructor: Finished 176727
object-reconstructor: Exited 176727

Error logs on remote node
2023-02-16T03:49:55.322872+00:00 object-server-r: warning ssync subrequest failed with 409: POST /d38782/229849/XXX/XXXX/XXX (<html><h1>Conflict</h1><p>There was a conflict when trying to complete your request.</p></html>)
2023-02-16T03:55:21.584880+00:00 object-server-r: warning ssync subrequest failed with 409: POST /d38782/229849/XXX/XXXX/XXX (<html><h1>Conflict</h1><p>There was a conflict when trying to complete your request.</p></html>)
....

Object info on remote node
$ ls
XXX.data XXX.meta
$ swift-object-info XXX
Path: XXX
  Account: XXX
  Container: XXX
  Object: XXX
  Object hash: XXX
Content-Type: binary/octet-stream
Timestamp: 2023-02-10T08:24:22.774370 (1676017462.77437_0000000000000001)
System Metadata:
  X-Object-Sysmeta-Allow-Reserved-Names: true
  X-Object-Sysmeta-Container-Update-Override-Etag: XXX
  X-Object-Sysmeta-Ec-Content-Length: 0
  X-Object-Sysmeta-Ec-Etag: XXX
  X-Object-Sysmeta-Ec-Frag-Index: 2
  X-Object-Sysmeta-Ec-Scheme: isa_l_rs_cauchy 8+4
  X-Object-Sysmeta-Ec-Segment-Size: 1048576
  X-Object-Sysmeta-Symlink-Target: XXX
  X-Object-Sysmeta-Symlink-Target-Bytes: 7
  X-Object-Sysmeta-Symlink-Target-Etag: XXX
  X-Object-Sysmeta-Symloop-Extend: true
  X-Object-Sysmeta-Versions-Symlink: true
Transient System Metadata:
  No metadata found
User Metadata:
  No metadata found
Other Metadata:
  No metadata found
ETag: XXX (valid)
Content-Length: 0 (valid)
Partition 229849
Hash XXX

Object info on source node
$ ls
XXX.data XXX.meta
$ swift-object-info XXX
Path: XXX
  Account: XXX
  Container: XXX
  Object: XXX
  Object hash: e0766dfd7443638268add951f36d3e8b
Content-Type: binary/octet-stream
Timestamp: 2023-02-10T08:24:22.774370 (1676017462.77437_0000000000000001)
System Metadata:
  X-Object-Sysmeta-Allow-Reserved-Names: true
  X-Object-Sysmeta-Container-Update-Override-Etag: XXX
  X-Object-Sysmeta-Ec-Content-Length: 0
  X-Object-Sysmeta-Ec-Etag: XXX
  X-Object-Sysmeta-Ec-Frag-Index: 2
  X-Object-Sysmeta-Ec-Scheme: isa_l_rs_cauchy 8+4
  X-Object-Sysmeta-Ec-Segment-Size: 1048576
  X-Object-Sysmeta-Symlink-Target: XXX
  X-Object-Sysmeta-Symlink-Target-Bytes: 7
  X-Object-Sysmeta-Symlink-Target-Etag: XXX
  X-Object-Sysmeta-Symloop-Extend: true
  X-Object-Sysmeta-Versions-Symlink: true
Transient System Metadata:
  No metadata found
User Metadata:
  No metadata found
Other Metadata:
  No metadata found
ETag: XXX (valid)
Content-Length: 0 (valid)
Partition 229849
Hash XXX

Changed in swift:
status: New → Confirmed
importance: Undecided → Critical
Revision history for this message
Alistair Coles (alistair-coles) wrote :

The reconstructor/ssync appears to be repeatedly trying to sync an object that is already in sync. That will cause a 409 conflict in the ssync subrequest stream.

We inspected the on disk files for this object on the sender and receiver side. Both had identical timestamps. However, the data files have an offset timestamps, and there is also a meta file.

i.e. both sender and receiver have:

t0_1#2#d.data
t1.meta

The sender encodes these timestamps in a compact form and the receiver decodes them and compares to its on disk file set.

The encoding represents t1 (the meta timestamp) as a delta from the data timestamp, NOT INCLUDING the data timestamp offset, i.e. delta = t1 - t0.

The decoding is erroneously calculating the meta timestamp as the sum of the delta plus the data timestamp INCLUDING the offset.

So the ssync receiver therefore erroneously thinks that the sender has a newer meta file:
t0_1#2#d.data
t1_1.meta

and so the receiver request that the sender POSTs its meta file content. On receiving the POST, the ssync receiver object server compares the correct timestamps and discovers they are identical to what it has on disk and so returns a 409 Conflict response.

Revision history for this message
Alistair Coles (alistair-coles) wrote (last edit ):

This bug will manifest predictably in the following circumstances:

1. An object's data file has been created with a timestamp offset. This happens when an object is versioned (the offset if 1) or migrated between policies by the reconciler (the offset is 2).

2. The object subsequently has POST request that creates a meta file (for example a POST to set X-Delete-At)

It would appear that in these circumstances, if that object is included in the scope of a reconstructor job, but the object is already in sync, the ssync protocol will always fail with a 409 and the reconstructor will therefore fail to process the partition.

Changed in swift:
status: Confirmed → In Progress
Revision history for this message
Alistair Coles (alistair-coles) wrote :
Changed in swift:
assignee: nobody → Alistair Coles (alistair-coles)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.opendev.org/c/openstack/swift/+/874122
Committed: https://opendev.org/openstack/swift/commit/2fe18b24cd7e709c8712fcda7369b0715a21fb89
Submitter: "Zuul (22348)"
Branch: master

commit 2fe18b24cd7e709c8712fcda7369b0715a21fb89
Author: Alistair Coles <email address hidden>
Date: Thu Feb 16 18:39:36 2023 +0000

    ssync: fix decoding of ts_meta when ts_data has offset

    The SsyncSender encodes object file timestamps in a compact form and
    the SsyncReceiver decodes the timestamps and compares them to its
    object file set.

    The encoding represents the meta file timestamp as a delta from the
    data file timestamp, NOT INCLUDING the data file timestamp offset.

    Previously, the decoding was erroneously calculating the meta file
    timestamp as the sum of the delta plus the data file timestamp
    INCLUDING the offset.

    For example, if the SssyncSender has object file timestamps:

      ts_data = t0_1.data
      ts_meta = t1.data

    then the receiver would erroneously perceive that the sender has:

      ts_data = t0_1.data
      ts_meta = t1_1.data

    As described in the referenced bug report, this erroneous decoding
    could cause the SsyncReceiver to request that the SsyncSender sync an
    object that is already in sync, which results in a 409 Conflict at the
    receiver. The 409 causes the ssync session to terminate, and the same
    process repeats on the next attempt.

    Closes-Bug: #2007643
    Co-Authored-By: Clay Gerrard <email address hidden>
    Change-Id: I74a0aac0ac29577026743f87f4b654d85e8fcc80

Changed in swift:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/swift 2.32.0

This issue was fixed in the openstack/swift 2.32.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.