LockError on blob cache

Bug #1038167 reported by Paul Everitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Won't Fix
Medium
Tres Seaver

Bug Description

 Wed Aug 15 06:21:30 2012 ERROR karl Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock; pid=UNKNOWN
Details

Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock; pid=UNKNOWN

Traceback (most recent call last):
  File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 76, in __init__
    _lock_file(fp)
  File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 59, in _lock_file
    raise LockError("Couldn't lock %r" % file.name)
LockError: Couldn't lock '/srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock'

Chris wrote:

Whatever it is seems to be done now. There are a lot more locking
errors in the log but they have to do with the lock used by
'check_size' to prune the blob_cache. I think what's going on,
though, is we're having multiple processes trying to use the same blob
cache and we probably shouldn't really be doing that. It worked
earlier when it was all on one box--then all processes used the blob
files used by the database itself and I guess it was anticipated that
multiple processes would be accessing that pile. With the server over
on another box we use a non-shared blob_cache, meaning the database
server has its pile and the client just has a smaller local cache. It
appears, though, that cache wasn't intended to be used by more than
one process at a time so you occasionally see lock errors, although
this is the first time I've seen it for something besides the
check_size thing which itself, is pretty ancillary.

So the long and the short of it is, to be entirely correct in our
usage we probably need each process (both webapp procs, mailin,
gsa_sync, etc...) to use its own blob cache, so we can avoid locking
errors. And of course, we'll need to make sure the blob_caches are
cleaned up when a process exits, else they'll stack up and eat our
disk.

Fortunately, this seems to be a pretty rare occurrence, so it's
probably not super time critical. We have, after all, been using
non-shared blob caches since our switch to gocept.

Ok. One thing I notice is that there isn't a stack trace, which means
it isn't being reported by our exception capturing stuff. It's just
something, somewhere, using log.error(...). This means that whatever
is going on it may not have an end user impact. Regardless, though,
it's obviously annoying if it's going to be tripping the alarm.

Tags: operations
JimPGlenn (jpglenn09)
Changed in karl3:
milestone: none → m117
JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m117 → m118
JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m118 → m119
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

We need to slow everything down until we get clarity on the Q4 budget.

Changed in karl3:
milestone: m119 → m120
Changed in karl3:
milestone: m120 → m122
JimPGlenn (jpglenn09)
Changed in karl3:
milestone: m122 → m123
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

After some discussion, this one is too hard to fix.

Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → nobody
milestone: m123 → m999
status: New → Won't Fix
Revision history for this message
Alexander Bittner (yb) wrote :

Hi!

Since the beginning of this year, we are affected by this bug in our MultiKARL setup which is currently serving three customers. We see exactly the same traceback in our error monitor multiple times a day. However, it only shows up for one customer, especially during the morning hours, likely when people start working :-)

Fri Feb 8 09:53:30 2013 ERROR karl Error locking file
Error locking file /srv/multikarl/production/14/var/blob_cache/oxfam/259/.lock; pid=UNKNOWN

Traceback (most recent call last):
  File "/srv/multikarl/production/14/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 76, in __init__
    _lock_file(fp)
  File "/srv/multikarl/production/14/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 59, in _lock_file
    raise LockError("Couldn't lock %r" % file.name)
LockError: Couldn't lock '/srv/multikarl/production/14/var/blob_cache/oxfam/259/.lock'

What is interesting about this, is that when we see the locking error (per day), it always appears multiple times and is always about the same object (259 here) with exactly the same time stamp. In the example above, we see 6 occurrences with the time stamp 09:53:30.

As a note, we did not yet receive any customer complaints that anything is not working.

Since this bug has been filed in last August, I'm not sure which of the aspects that Chris is talking about in his description are still the case w.r.t our current setup. Chris knows our setup and involved processes quite good. So maybe he can try to re-evaluate his findings from August, relating them to our current setup and the fact that we now see this bug in our error monitor.

If it should turn out that there will be nothing we can do about this and that it will have no impact on user experience, we'd be happy if we could try to get this error being tracked at a lower log level (INFO or WARNING), just to avoid the generation of alarms.

Best regards,
Alex

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

I did later do some more research that doesn't seem to have made it into the comment thread in this ticket. The essential problem is that there isn't a good Blob implementation that matches our case extremely well. When you create a ZODB instance with Blob storage, you tell it whether or not you want a "shared" blob storage. "Shared" means that all blobs live in a single master folder shared by all processes. No blobs are stored in the database instance itself. Non-shared means that blobs live in a single database instance and are retrieved over a network connection, and stored locally in a cache which is neither complete nor definitive. The code for this mode assumes that only one process will ever access this cache.

The problem for us is neither of these really fits our model well. Obviously, with multiple app servers and a single database server, a shared blob storage doesn't work for us. The database server must be responsible for the complete and definitive set of blobs, and our app servers need to retrieve and store them locally as needed. But, since the local blob cache is built with the assumption that it will only ever be used by a single process we hit the error above.

I think the ideal solution would be to "fix" ZODB such that the local blob caches don't require exclusive lock. This however might be somewhat expensive in terms of programming hours to fix and probably even more expensive in terms of politics of convincing the core devs that such a fix is necessary or suitable.

Another solution might be to re-engineer the clusters a bit so that there's a common networked disk that could be used as a shared blob storage, which is engineered for concurrency.

A more hacky solution would be something like what I suggested back in August: have each process generate its own ZODB URI that gives it its own blob cache. Some care needs to be taken to make sure these caches are removed even when a process exits abnormally. It's not the best solution in terms of performance, but it should be feasible.

The last, most expedient solution is simply to stifle an ERROR level log messages coming out of zc.lockfile. This should be possible with standard library logging configuration. We haven't been given any reason to believe the errors have an impact on the end users, so this might be a suitable solution.

I suspect, but have not confirmed, that the source of these error messages is actually the thread used by ZODB to prune blob caches to keep them under the maximum size. If that is a correct hypothesis, it may be possible to simply turn off the max limit and use an external script to limit the size of the blob cache.

It may be a matter of deciding which solution is the least objectionable. Let me know what you think.

Revision history for this message
Christian Theune (ctheune) wrote :

All options would be OK. I do like avoid NFS though, but we are using it under some circumstances.

External pruning is fine - the ZODB cache implementation is actually expecting that external processes might delete cache files.

We use tmpwatch for this.

Setting up multiple cache directories (one for each client process) would be fine as well - the tmpwatch needs to be adjusted to keep the size manageable.

I'll talk with Alex on this personally what our choice would be regarding the specific support ticket.

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1038167] LockError on blob cache
Download full text (5.9 KiB)

Good analysis, Chris. I saw gocept respond agreeing with what Tres and Chris told me, that they need to get the configuration setup right. I'll assign the ticket to them.

--Paul

On Mar 4, 2013, at 10:46 AM, Chris Rossi <email address hidden> wrote:

> I did later do some more research that doesn't seem to have made it into
> the comment thread in this ticket. The essential problem is that there
> isn't a good Blob implementation that matches our case extremely well.
> When you create a ZODB instance with Blob storage, you tell it whether
> or not you want a "shared" blob storage. "Shared" means that all blobs
> live in a single master folder shared by all processes. No blobs are
> stored in the database instance itself. Non-shared means that blobs
> live in a single database instance and are retrieved over a network
> connection, and stored locally in a cache which is neither complete nor
> definitive. The code for this mode assumes that only one process will
> ever access this cache.
>
> The problem for us is neither of these really fits our model well.
> Obviously, with multiple app servers and a single database server, a
> shared blob storage doesn't work for us. The database server must be
> responsible for the complete and definitive set of blobs, and our app
> servers need to retrieve and store them locally as needed. But, since
> the local blob cache is built with the assumption that it will only ever
> be used by a single process we hit the error above.
>
> I think the ideal solution would be to "fix" ZODB such that the local
> blob caches don't require exclusive lock. This however might be
> somewhat expensive in terms of programming hours to fix and probably
> even more expensive in terms of politics of convincing the core devs
> that such a fix is necessary or suitable.
>
> Another solution might be to re-engineer the clusters a bit so that
> there's a common networked disk that could be used as a shared blob
> storage, which is engineered for concurrency.
>
> A more hacky solution would be something like what I suggested back in
> August: have each process generate its own ZODB URI that gives it its
> own blob cache. Some care needs to be taken to make sure these caches
> are removed even when a process exits abnormally. It's not the best
> solution in terms of performance, but it should be feasible.
>
> The last, most expedient solution is simply to stifle an ERROR level log
> messages coming out of zc.lockfile. This should be possible with
> standard library logging configuration. We haven't been given any
> reason to believe the errors have an impact on the end users, so this
> might be a suitable solution.
>
> I suspect, but have not confirmed, that the source of these error
> messages is actually the thread used by ZODB to prune blob caches to
> keep them under the maximum size. If that is a correct hypothesis, it
> may be possible to simply turn off the max limit and use an external
> script to limit the size of the blob cache.
>
> It may be a matter of deciding which solution is the least
> objectionable. Let me know what you think.
>
> --
> You received this bug notification because you are sub...

Read more...

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Adding to M125 and assigning to Alex based on Christian's comment.

Changed in karl3:
assignee: nobody → Alexander Bittner (yb)
status: Won't Fix → Confirmed
milestone: m999 → m125
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Christian wrote: "I'll talk with Alex on this personally what our choice would be regarding the specific support ticket."

Alex, did Christian have a chance to talk with you about this?

Revision history for this message
Christian Theune (ctheune) wrote :

I can't reven remember whether I talked to Alex. Unfortunately he's sick currently. I'll pick this up again after easter when he'll be back.

Changed in karl3:
milestone: m125 → m126
Revision history for this message
Christian Theune (ctheune) wrote :

I was able to review this with Alex today.

I think Chris Rossi's analysis isn't 100% correct: the ZEO cache layout assumes that other agents can operate on the directory at the same time and has appropriate locking and waiting behaviour. I can see that relstorage also implements the same protocol.

From my perspective we're tracking down a bug here, so I think we should

* identify the cause of this traceback (I don't understand which part of the application causes the traceback)
* fix the bug
* stay with a single directory
* add external tmpwatch

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Analysis from gocept complete, over to us to work on later in the month.

Changed in karl3:
assignee: Alexander Bittner (yb) → Tres Seaver (tseaver)
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

This week or next, Tres can look at comment 10 and give his input.

Tres Seaver (tseaver)
Changed in karl3:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.