KARL3

Bug #1038167
Comment #6

Comment 6 for bug 1038167

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2013-03-05: Re: [Bug 1038167] LockError on blob cache

Good analysis, Chris. I saw gocept respond agreeing with what Tres and Chris told me, that they need to get the configuration setup right. I'll assign the ticket to them.

--Paul

On Mar 4, 2013, at 10:46 AM, Chris Rossi <email address hidden> wrote:

> I did later do some more research that doesn't seem to have made it into
> the comment thread in this ticket. The essential problem is that there
> isn't a good Blob implementation that matches our case extremely well.
> When you create a ZODB instance with Blob storage, you tell it whether
> or not you want a "shared" blob storage. "Shared" means that all blobs
> live in a single master folder shared by all processes. No blobs are
> stored in the database instance itself. Non-shared means that blobs
> live in a single database instance and are retrieved over a network
> connection, and stored locally in a cache which is neither complete nor
> definitive. The code for this mode assumes that only one process will
> ever access this cache.
>
> The problem for us is neither of these really fits our model well.
> Obviously, with multiple app servers and a single database server, a
> shared blob storage doesn't work for us. The database server must be
> responsible for the complete and definitive set of blobs, and our app
> servers need to retrieve and store them locally as needed. But, since
> the local blob cache is built with the assumption that it will only ever
> be used by a single process we hit the error above.
>
> I think the ideal solution would be to "fix" ZODB such that the local
> blob caches don't require exclusive lock. This however might be
> somewhat expensive in terms of programming hours to fix and probably
> even more expensive in terms of politics of convincing the core devs
> that such a fix is necessary or suitable.
>
> Another solution might be to re-engineer the clusters a bit so that
> there's a common networked disk that could be used as a shared blob
> storage, which is engineered for concurrency.
>
> A more hacky solution would be something like what I suggested back in
> August: have each process generate its own ZODB URI that gives it its
> own blob cache. Some care needs to be taken to make sure these caches
> are removed even when a process exits abnormally. It's not the best
> solution in terms of performance, but it should be feasible.
>
> The last, most expedient solution is simply to stifle an ERROR level log
> messages coming out of zc.lockfile. This should be possible with
> standard library logging configuration. We haven't been given any
> reason to believe the errors have an impact on the end users, so this
> might be a suitable solution.
>
> I suspect, but have not confirmed, that the source of these error
> messages is actually the thread used by ZODB to prune blob caches to
> keep them under the maximum size. If that is a correct hypothesis, it
> may be possible to simply turn off the max limit and use an external
> script to limit the size of the blob cache.
>
> It may be a matter of deciding which solution is the least
> objectionable. Let me know what you think.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1038167
>
> Title:
> LockError on blob cache
>
> Status in KARL3:
> Won't Fix
>
> Bug description:
> Wed Aug 15 06:21:30 2012 ERROR karl Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock; pid=UNKNOWN
> Details
>
> Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock;
> pid=UNKNOWN
>
> Traceback (most recent call last):
> File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 76, in __init__
> _lock_file(fp)
> File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 59, in _lock_file
> raise LockError("Couldn't lock %r" % file.name)
> LockError: Couldn't lock '/srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock'
>
>
> Chris wrote:
>
> Whatever it is seems to be done now. There are a lot more locking
> errors in the log but they have to do with the lock used by
> 'check_size' to prune the blob_cache. I think what's going on,
> though, is we're having multiple processes trying to use the same blob
> cache and we probably shouldn't really be doing that. It worked
> earlier when it was all on one box--then all processes used the blob
> files used by the database itself and I guess it was anticipated that
> multiple processes would be accessing that pile. With the server over
> on another box we use a non-shared blob_cache, meaning the database
> server has its pile and the client just has a smaller local cache. It
> appears, though, that cache wasn't intended to be used by more than
> one process at a time so you occasionally see lock errors, although
> this is the first time I've seen it for something besides the
> check_size thing which itself, is pretty ancillary.
>
> So the long and the short of it is, to be entirely correct in our
> usage we probably need each process (both webapp procs, mailin,
> gsa_sync, etc...) to use its own blob cache, so we can avoid locking
> errors. And of course, we'll need to make sure the blob_caches are
> cleaned up when a process exits, else they'll stack up and eat our
> disk.
>
> Fortunately, this seems to be a pretty rare occurrence, so it's
> probably not super time critical. We have, after all, been using
> non-shared blob caches since our switch to gocept.
>
> Ok. One thing I notice is that there isn't a stack trace, which means
> it isn't being reported by our exception capturing stuff. It's just
> something, somewhere, using log.error(...). This means that whatever
> is going on it may not have an end user impact. Regardless, though,
> it's obviously annoying if it's going to be tripping the alarm.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1038167/+subscriptions

Good analysis, Chris. I saw gocept respond agreeing with what Tres and Chris told me, that they need to get the configuration setup right. I'll assign the ticket to them.

--Paul

On Mar 4, 2013, at 10:46 AM, Chris Rossi <chris@archimedeanco.com> wrote:

> I did later do some more research that doesn't seem to have made it into
> the comment thread in this ticket.  The essential problem is that there
> isn't a good Blob implementation that matches our case extremely well.
> When you create a ZODB instance with Blob storage, you tell it whether
> or not you want a "shared" blob storage.  "Shared" means that all blobs
> live in a single master folder shared by all processes.  No blobs are
> stored in the database instance itself.  Non-shared means that blobs
> live in a single database instance and are retrieved over a network
> connection, and stored locally in a cache which is neither complete nor
> definitive.  The code for this mode assumes that only one process will
> ever access this cache.
> 
> The problem for us is neither of these really fits our model well.
> Obviously, with multiple app servers and a single database server, a
> shared blob storage doesn't work for us.  The database server must be
> responsible for the complete and definitive set of blobs, and our app
> servers need to retrieve and store them locally as needed.  But, since
> the local blob cache is built with the assumption that it will only ever
> be used by a single process we hit the error above.
> 
> I think the ideal solution would be to "fix" ZODB such that the local
> blob caches don't require exclusive lock.  This however might be
> somewhat expensive in terms of programming hours to fix and probably
> even more expensive in terms of politics of convincing the core devs
> that such a fix is necessary or suitable.
> 
> Another solution might be to re-engineer the clusters a bit so that
> there's a common networked disk that could be used as a shared blob
> storage, which is engineered for concurrency.
> 
> A more hacky solution would be something like what I suggested back in
> August: have each process generate its own ZODB URI that gives it its
> own blob cache.  Some care needs to be taken to make sure these caches
> are removed even when a process exits abnormally.  It's not the best
> solution in terms of performance, but it should be feasible.
> 
> The last, most expedient solution is simply to stifle an ERROR level log
> messages coming out of zc.lockfile.  This should be possible with
> standard library logging configuration.  We haven't been given any
> reason to believe the errors have an impact on the end users, so this
> might be a suitable solution.
> 
> I suspect, but have not confirmed, that the source of these error
> messages is actually the thread used by ZODB to prune blob caches to
> keep them under the maximum size.  If that is a correct hypothesis, it
> may be possible to simply turn off the max limit and use an external
> script to limit the size of the blob cache.
> 
> It may be a matter of deciding which solution is the least
> objectionable.  Let me know what you think.
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1038167
> 
> Title:
>  LockError on blob cache
> 
> Status in KARL3:
>  Won't Fix
> 
> Bug description:
>   Wed Aug 15 06:21:30 2012 ERROR karl Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock; pid=UNKNOWN
>  Details
> 
>  Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock;
>  pid=UNKNOWN
> 
>  Traceback (most recent call last):
>    File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 76, in __init__
>      _lock_file(fp)
>    File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 59, in _lock_file
>      raise LockError("Couldn't lock %r" % file.name)
>  LockError: Couldn't lock '/srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock'
> 
> 
>  Chris wrote:
> 
>  Whatever it is seems to be done now.  There are a lot more locking
>  errors in the log but they have to do with the lock used by
>  'check_size' to prune the blob_cache.  I think what's going on,
>  though, is we're having multiple processes trying to use the same blob
>  cache and we probably shouldn't really be doing that.  It worked
>  earlier when it was all on one box--then all processes used the blob
>  files used by the database itself and I guess it was anticipated that
>  multiple processes would be accessing that pile.  With the server over
>  on another box we use a non-shared blob_cache, meaning the database
>  server has its pile and the client just has a smaller local cache.  It
>  appears, though, that cache wasn't intended to be used by more than
>  one process at a time so you occasionally see lock errors, although
>  this is the first time I've seen it for something besides the
>  check_size thing which itself, is pretty ancillary.
> 
>  So the long and the short of it is, to be entirely correct in our
>  usage we probably need each process (both webapp procs, mailin,
>  gsa_sync, etc...) to use its own blob cache, so we can avoid locking
>  errors.  And of course, we'll need to make sure the blob_caches are
>  cleaned up when a process exits, else they'll stack up and eat our
>  disk.
> 
>  Fortunately, this seems to be a pretty rare occurrence, so it's
>  probably not super time critical.  We have, after all, been using
>  non-shared blob caches since our switch to gocept.
> 
>  Ok.  One thing I notice is that there isn't a stack trace, which means
>  it isn't being reported by our exception capturing stuff.  It's just
>  something, somewhere, using log.error(...).  This means that whatever
>  is going on it may not have an end user impact.  Regardless, though,
>  it's obviously annoying if it's going to be tripping the alarm.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1038167/+subscriptions