Bit rot inside container_info table

Bug #1838466 reported by Bjoern
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
New
Undecided
Unassigned

Bug Description

Based on a production issue with a Liberty swift version (i know it's EOL) I noticed that the container_stats view contained invalid data, as

sqlite> select * from container_stat;
AUTH_2943f48102304a43a5bcf7eefbd22b01|mailstore-rs-119-1-104|1478133529.15135|147:133539&16340|0|1478133529.16340|0|15034|2207569538|cb43cd4935e5d079e0c0f00806bd0c67|f729fbd8-b79f-44c9-869d-d1c3da0cf01e||1478133529.16340||-1|-1|319046|0|15034|2207569538

the put timestamp contained 147:133539&16340 which should have been 1478133529.16340.
I suspect the issue was caused by 2 bits flipped, but the machine is running ECC memory and did not endure filesystem (XFS) issues or crashes so it remains a mystery where the corruption is happening.

Is there a reason why the put_timestamp was defined as text inside the source of the view (container_info table) rather than floating point or datetime, so we would hopefully recognize these issues during put requests?

The corruption generates a 500 when querying the container service for the impacted database and I manually fixed the sqlite database to allow container listings to succeed now

Bjoern (bjoern-t)
description: updated
Revision history for this message
Bjoern (bjoern-t) wrote :

This is related to https://bugs.launchpad.net/swift/+bug/1823785 where I originally reported it

Revision history for this message
Tim Burke (1-tim-z) wrote :

sqlite will happily store whatever you tell it to store, so declaring datatypes is more of a hint to developers than anything else:

  $ sqlite3 test.db
  SQLite version 3.26.0 2018-12-01 12:34:55
  Enter ".help" for usage hints.
  sqlite> create table t(some_int int, some_date datetime);
  sqlite> insert into t (some_int, some_date) values ('foo', 'bar');
  sqlite> select * from t;
  foo|bar
  sqlite> .schema t
  CREATE TABLE t(some_int int, some_date datetime);

Even so, floating point representations can be problematic when it comes to round-tripping:

  >>> 99999999999.99999
  99999999999.99998

(Though admittedly that's an order of magnitude outside of the range at which we designed swift to work.)

I take it this means you've been seeing more instances of corruption since that initial report? If so, does it seem to be tied to the same disks? That is, is there any overlap in the set of disks that were responsible for the corrupted container back in April and those responsible for this one now? Hmm...

Revision history for this message
Bjoern (bjoern-t) wrote :

This is from an issue going back to May which could be related to the April issue but at this point I don't know if it is actively happening until we would do an audit which at this point is not planned due to the vast size of our installation.

Thanks for the sqlite test, that would mean we would have to look for improvements inside the code to not replicate corrupt databases etc so I will close this issue here and we can look at LP1823785

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.