OpenStack Object Storage (swift)

Bug #1068426
Comment #3

Comment 3 for bug 1068426

Revision history for this message

Samuel Merritt (torgomatic) wrote on 2012-11-21:

This bug confuses me. (This is not a difficult task.)

I've spent some time reading the container-sync code, and here's how I think it works:

Each container DB has two sync-points SP1 and SP2. They're stored as "x_container_sync_point1" and "x_container_sync_point2", but I'm going to call them SP1 and SP2 to save keystrokes.

Initially, SP1 = SP2 = -1.

When container sync runs, the first thing it does is figure out which replica of the container it has, and names that "ordinal". If there are 3 replicas, then ordinal is either 0, 1, or 2. Each container-sync process will run on a different machine, so no two container-sync processes will ever have the same container *and* the same ordinal together (i.e. the same (container, ordinal)) pair.

Then, for every row where SP2 < row-id < SP1 and hash(container) % 3 != ordinal, it syncs that row to the remote server. That not-equals is kind of weird at first glance, but the effect is that it syncs up the rows that it didn't do last time. After it's done here, SP2 = SP1.

The third and final thing the container-server does is sync rows where SP1 < row-id and hash(object) % 3 == ordinal. After it's done, SP1 is advanced to be the newest row-id.

So essentially there are 3 categories of row: new rows, old rows, and ancient rows.

                 SP2 SP1
                  | |
  0 --------------+------------+--------- newest-row
       ancient old new

The first thing that happens is to sync old rows that this process is *not* responsible for (hash(object) != ordinal):

                              SP2
                              SP1
                               |
  0 ---------------------------+--------- newest-row
            ancient new

Then we sync new rows that this process *is* responsible for (hash(object) == ordinal):

                              SP2 SP1
                               | |
  0 ---------------------------+--------+ newest-row
            ancient old

And of course, as container contents change, new rows get appended on the right:

                              SP2 SP1
                               | |
  0 ---------------------------+--------+----- newest-row
            ancient old new

I wrote all that mainly so you or anyone else can tell me where I've misunderstood things.

Without knowing much about your setup, I suspect you've got a small number of containers with sync enabled. If there were a large number of sync-enabled containers, then different sync processes would tend to end up working on different containers at a given time by virtue of how the containers are distributed across machines.

When you get collisions, I'm guessing that the duplicate work you see is the old-row syncing. If the new-row syncing takes a long time for sync-process 1 but not for sync-process 2, then sync-process 2 will run through its new work quickly, and then on the next cycle will start syncing old rows that processes 1 and 3 should have done. It'll race through the already-synced stuff pretty quickly and then reach the rows that sync-process 1 is working on, and that's when you get duplicate work.

At large scale, the wide distribution of containers and relatively large numbers of synced containers will reduce the amount of duplicate work to a very small fraction of the overall work. At small scale, there'll be more collisions. In the limit as scale gets really small, if there's only one sync-enabled container in the whole cluster, then you'll end up with duplicate work all the time.

I don't have any good ideas on how to fix this right now. Maybe someone who understands the problem better can chime in.

This bug confuses me. (This is not a difficult task.)

I've spent some time reading the container-sync code, and here's how I think it works:

Each container DB has two sync-points SP1 and SP2. They're stored as "x_container_sync_point1" and "x_container_sync_point2", but I'm going to call them SP1 and SP2 to save keystrokes.

Initially, SP1 = SP2 = -1.

The third and final thing the container-server does is sync rows where SP1 < row-id and hash(object) % 3 == ordinal. After it's done, SP1 is advanced to be the newest row-id.

So essentially there are 3 categories of row: new rows, old rows, and ancient rows.

SP2          SP1
                  |            |
  0 --------------+------------+--------- newest-row
       ancient         old          new

The first thing that happens is to sync old rows that this process is *not* responsible for (hash(object) != ordinal):

SP2
                              SP1
                               |
  0 ---------------------------+--------- newest-row
            ancient                new

Then we sync new rows that this process *is* responsible for (hash(object) == ordinal):

SP2      SP1
                               |        |
  0 ---------------------------+--------+ newest-row
            ancient                old

And of course, as container contents change, new rows get appended on the right:

SP2      SP1
                               |        |
  0 ---------------------------+--------+----- newest-row
            ancient                old    new

I wrote all that mainly so you or anyone else can tell me where I've misunderstood things.

I don't have any good ideas on how to fix this right now. Maybe someone who understands the problem better can chime in.