XFS corruption can create zero-byte partition files instead of dirs

Bug #1045954 reported by Darrell Bishop
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Undecided
Darrell Bishop

Bug Description

In the past, I had to run

 # xfs_repair /dev/sdc

Later, after dropping that device's weight in the object ring to 0, I noticed the object-replicator was poisoned by a zero-byte file where a directory should have been:

  # ll /srv/node/d9/objects/
  ...
  -rw-r--r-- 1 swift swift 0 2012-08-17 11:26 188978

This causes the object-replicator to fail trying to handle this "partition" with the following traceback. Note that this is "benign" in the sense that all data (that didn't otherwise get screwed up when my XFS filesystem got a little mucked up) did get replicated off the drive. However, this still results in log spew and continuous incrementing of the object-replicator.partition.delete.count.<device> StatsD metric because update_delete() keeps getting called for this "partition" and then failing.

Sep 4 10:57:39 swift-test-01 object-replicator Error syncing handoff partition: #012Traceback (most recent call last):#012 File "/usr/lib/pymodules/python2.7/swift/obj/replicator.py", line 373, in update_deleted#012 suffixes = tpool.execute(tpool_get_suffixes, job['path'])#012 File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 76, in tworker#012 rv = meth(*args,**kwargs)#012 File "/usr/lib/pymodules/python2.7/swift/obj/replicator.py", line 366, in tpool_get_suffixes#012 return [suff for suff in os.listdir(path)#012OSError: [Errno 20] Not a directory: '/srv/node/d9/objects/188978'

This means there are some portions of Swift which are not robust to zero-byte files being where they normally shouldn't. That node's regular and zero-byte-file object-auditor processes are not reporting any errors (nor are they fixing this zero-byte-file-where-a-partition-directory-should-be problem, either).

I think collect_jobs() should verify that the partition paths it puts into jobs are directories and not zero-byte files. I think if collect_jobs notices a zero-byte file where a partition directory should be, it should log this (WARNING level?), remove the zero-byte file, and then move on, not creating a job for that partition path.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (master)

Fix proposed to branch: master
Review: https://review.openstack.org/12378

Changed in swift:
assignee: nobody → Darrell Bishop (darrellb)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/12378
Committed: http://github.com/openstack/swift/commit/46a093f068a158b72479522792509a882d3e47f1
Submitter: Jenkins
Branch: master

commit 46a093f068a158b72479522792509a882d3e47f1
Author: Darrell Bishop <email address hidden>
Date: Tue Sep 4 13:59:26 2012 -0700

    Obj replicator cleans up files where part dirs should be.

    If a partition directory was a file instead of a directory, the
    object-replicator would attempt to listdir() it, raise an exception, and
    try again next iteration. This condition could arise after running
    xfs_repair.

    Now, collect_jobs() will reap any partition directories which are
    actually files. Fixes bug 1045954.

    Change-Id: Id65d3eab2effd61c3f6b25250611c88c907b2a16

Changed in swift:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in swift:
milestone: none → 1.7.5
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.