Comment 6 for bug 1038167

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 1038167] LockError on blob cache

Good analysis, Chris. I saw gocept respond agreeing with what Tres and Chris told me, that they need to get the configuration setup right. I'll assign the ticket to them.

--Paul

On Mar 4, 2013, at 10:46 AM, Chris Rossi <email address hidden> wrote:

> I did later do some more research that doesn't seem to have made it into
> the comment thread in this ticket. The essential problem is that there
> isn't a good Blob implementation that matches our case extremely well.
> When you create a ZODB instance with Blob storage, you tell it whether
> or not you want a "shared" blob storage. "Shared" means that all blobs
> live in a single master folder shared by all processes. No blobs are
> stored in the database instance itself. Non-shared means that blobs
> live in a single database instance and are retrieved over a network
> connection, and stored locally in a cache which is neither complete nor
> definitive. The code for this mode assumes that only one process will
> ever access this cache.
>
> The problem for us is neither of these really fits our model well.
> Obviously, with multiple app servers and a single database server, a
> shared blob storage doesn't work for us. The database server must be
> responsible for the complete and definitive set of blobs, and our app
> servers need to retrieve and store them locally as needed. But, since
> the local blob cache is built with the assumption that it will only ever
> be used by a single process we hit the error above.
>
> I think the ideal solution would be to "fix" ZODB such that the local
> blob caches don't require exclusive lock. This however might be
> somewhat expensive in terms of programming hours to fix and probably
> even more expensive in terms of politics of convincing the core devs
> that such a fix is necessary or suitable.
>
> Another solution might be to re-engineer the clusters a bit so that
> there's a common networked disk that could be used as a shared blob
> storage, which is engineered for concurrency.
>
> A more hacky solution would be something like what I suggested back in
> August: have each process generate its own ZODB URI that gives it its
> own blob cache. Some care needs to be taken to make sure these caches
> are removed even when a process exits abnormally. It's not the best
> solution in terms of performance, but it should be feasible.
>
> The last, most expedient solution is simply to stifle an ERROR level log
> messages coming out of zc.lockfile. This should be possible with
> standard library logging configuration. We haven't been given any
> reason to believe the errors have an impact on the end users, so this
> might be a suitable solution.
>
> I suspect, but have not confirmed, that the source of these error
> messages is actually the thread used by ZODB to prune blob caches to
> keep them under the maximum size. If that is a correct hypothesis, it
> may be possible to simply turn off the max limit and use an external
> script to limit the size of the blob cache.
>
> It may be a matter of deciding which solution is the least
> objectionable. Let me know what you think.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1038167
>
> Title:
> LockError on blob cache
>
> Status in KARL3:
> Won't Fix
>
> Bug description:
> Wed Aug 15 06:21:30 2012 ERROR karl Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock; pid=UNKNOWN
> Details
>
> Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock;
> pid=UNKNOWN
>
> Traceback (most recent call last):
> File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 76, in __init__
> _lock_file(fp)
> File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 59, in _lock_file
> raise LockError("Couldn't lock %r" % file.name)
> LockError: Couldn't lock '/srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock'
>
>
> Chris wrote:
>
> Whatever it is seems to be done now. There are a lot more locking
> errors in the log but they have to do with the lock used by
> 'check_size' to prune the blob_cache. I think what's going on,
> though, is we're having multiple processes trying to use the same blob
> cache and we probably shouldn't really be doing that. It worked
> earlier when it was all on one box--then all processes used the blob
> files used by the database itself and I guess it was anticipated that
> multiple processes would be accessing that pile. With the server over
> on another box we use a non-shared blob_cache, meaning the database
> server has its pile and the client just has a smaller local cache. It
> appears, though, that cache wasn't intended to be used by more than
> one process at a time so you occasionally see lock errors, although
> this is the first time I've seen it for something besides the
> check_size thing which itself, is pretty ancillary.
>
> So the long and the short of it is, to be entirely correct in our
> usage we probably need each process (both webapp procs, mailin,
> gsa_sync, etc...) to use its own blob cache, so we can avoid locking
> errors. And of course, we'll need to make sure the blob_caches are
> cleaned up when a process exits, else they'll stack up and eat our
> disk.
>
> Fortunately, this seems to be a pretty rare occurrence, so it's
> probably not super time critical. We have, after all, been using
> non-shared blob caches since our switch to gocept.
>
> Ok. One thing I notice is that there isn't a stack trace, which means
> it isn't being reported by our exception capturing stuff. It's just
> something, somewhere, using log.error(...). This means that whatever
> is going on it may not have an end user impact. Regardless, though,
> it's obviously annoying if it's going to be tripping the alarm.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1038167/+subscriptions