Bug #1038167 “LockError on blob cache” : Bugs : KARL3

JimPGlenn (jpglenn09) on 2012-08-29

Changed in karl3:
milestone:	none → m117

JimPGlenn (jpglenn09) on 2012-09-10

Changed in karl3:
milestone:	m117 → m118

JimPGlenn (jpglenn09) on 2012-09-24

Changed in karl3:
milestone:	m118 → m119

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2012-09-25:

#1

We need to slow everything down until we get clarity on the Q4 budget.

Changed in karl3:
milestone:	m119 → m120

Paul Everitt (paul-agendaless) on 2012-10-08

Changed in karl3:
milestone:	m120 → m122

JimPGlenn (jpglenn09) on 2012-11-26

Changed in karl3:
milestone:	m122 → m123

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2012-11-28:

#2

After some discussion, this one is too hard to fix.

Changed in karl3:
assignee:	Chris Rossi (chris-archimedeanco) → nobody
milestone:	m123 → m999
status:	New → Won't Fix

Revision history for this message

Alexander Bittner (yb) wrote on 2013-02-12:

#3

Hi!

Since the beginning of this year, we are affected by this bug in our MultiKARL setup which is currently serving three customers. We see exactly the same traceback in our error monitor multiple times a day. However, it only shows up for one customer, especially during the morning hours, likely when people start working :-)

Fri Feb 8 09:53:30 2013 ERROR karl Error locking file
Error locking file /srv/multikarl/production/14/var/blob_cache/oxfam/259/.lock; pid=UNKNOWN

Traceback (most recent call last):
  File "/srv/multikarl/production/14/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 76, in __init__
    _lock_file(fp)
  File "/srv/multikarl/production/14/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 59, in _lock_file
    raise LockError("Couldn't lock %r" % file.name)
LockError: Couldn't lock '/srv/multikarl/production/14/var/blob_cache/oxfam/259/.lock'

What is interesting about this, is that when we see the locking error (per day), it always appears multiple times and is always about the same object (259 here) with exactly the same time stamp. In the example above, we see 6 occurrences with the time stamp 09:53:30.

As a note, we did not yet receive any customer complaints that anything is not working.

Since this bug has been filed in last August, I'm not sure which of the aspects that Chris is talking about in his description are still the case w.r.t our current setup. Chris knows our setup and involved processes quite good. So maybe he can try to re-evaluate his findings from August, relating them to our current setup and the fact that we now see this bug in our error monitor.

If it should turn out that there will be nothing we can do about this and that it will have no impact on user experience, we'd be happy if we could try to get this error being tracked at a lower log level (INFO or WARNING), just to avoid the generation of alarms.

Best regards,
Alex

Revision history for this message

Chris Rossi (chris-archimedeanco) wrote on 2013-03-04:

#4

I did later do some more research that doesn't seem to have made it into the comment thread in this ticket. The essential problem is that there isn't a good Blob implementation that matches our case extremely well. When you create a ZODB instance with Blob storage, you tell it whether or not you want a "shared" blob storage. "Shared" means that all blobs live in a single master folder shared by all processes. No blobs are stored in the database instance itself. Non-shared means that blobs live in a single database instance and are retrieved over a network connection, and stored locally in a cache which is neither complete nor definitive. The code for this mode assumes that only one process will ever access this cache.

The problem for us is neither of these really fits our model well. Obviously, with multiple app servers and a single database server, a shared blob storage doesn't work for us. The database server must be responsible for the complete and definitive set of blobs, and our app servers need to retrieve and store them locally as needed. But, since the local blob cache is built with the assumption that it will only ever be used by a single process we hit the error above.

I think the ideal solution would be to "fix" ZODB such that the local blob caches don't require exclusive lock. This however might be somewhat expensive in terms of programming hours to fix and probably even more expensive in terms of politics of convincing the core devs that such a fix is necessary or suitable.

Another solution might be to re-engineer the clusters a bit so that there's a common networked disk that could be used as a shared blob storage, which is engineered for concurrency.

A more hacky solution would be something like what I suggested back in August: have each process generate its own ZODB URI that gives it its own blob cache. Some care needs to be taken to make sure these caches are removed even when a process exits abnormally. It's not the best solution in terms of performance, but it should be feasible.

The last, most expedient solution is simply to stifle an ERROR level log messages coming out of zc.lockfile. This should be possible with standard library logging configuration. We haven't been given any reason to believe the errors have an impact on the end users, so this might be a suitable solution.

I suspect, but have not confirmed, that the source of these error messages is actually the thread used by ZODB to prune blob caches to keep them under the maximum size. If that is a correct hypothesis, it may be possible to simply turn off the max limit and use an external script to limit the size of the blob cache.

It may be a matter of deciding which solution is the least objectionable. Let me know what you think.

I did later do some more research that doesn't seem to have made it into the comment thread in this ticket.  The essential problem is that there isn't a good Blob implementation that matches our case extremely well.  When you create a ZODB instance with Blob storage, you tell it whether or not you want a "shared" blob storage.  "Shared" means that all blobs live in a single master folder shared by all processes.  No blobs are stored in the database instance itself.  Non-shared means that blobs live in a single database instance and are retrieved over a network connection, and stored locally in a cache which is neither complete nor definitive.  The code for this mode assumes that only one process will ever access this cache.

The problem for us is neither of these really fits our model well.  Obviously, with multiple app servers and a single database server, a shared blob storage doesn't work for us.  The database server must be responsible for the complete and definitive set of blobs, and our app servers need to retrieve and store them locally as needed.  But, since the local blob cache is built with the assumption that it will only ever be used by a single process we hit the error above.

I think the ideal solution would be to "fix" ZODB such that the local blob caches don't require exclusive lock.  This however might be somewhat expensive in terms of programming hours to fix and probably even more expensive in terms of politics of convincing the core devs that such a fix is necessary or suitable.

Another solution might be to re-engineer the clusters a bit so that there's a common networked disk that could be used as a shared blob storage, which is engineered for concurrency.

A more hacky solution would be something like what I suggested back in August: have each process generate its own ZODB URI that gives it its own blob cache.  Some care needs to be taken to make sure these caches are removed even when a process exits abnormally.  It's not the best solution in terms of performance, but it should be feasible.

The last, most expedient solution is simply to stifle an ERROR level log messages coming out of zc.lockfile.  This should be possible with standard library logging configuration.  We haven't been given any reason to believe the errors have an impact on the end users, so this might be a suitable solution.

I suspect, but have not confirmed, that the source of these error messages is actually the thread used by ZODB to prune blob caches to keep them under the maximum size.  If that is a correct hypothesis, it may be possible to simply turn off the max limit and use an external script to limit the size of the blob cache.

It may be a matter of deciding which solution is the least objectionable.  Let me know what you think.

Revision history for this message

Christian Theune (ctheune) wrote on 2013-03-05:

#5

All options would be OK. I do like avoid NFS though, but we are using it under some circumstances.

External pruning is fine - the ZODB cache implementation is actually expecting that external processes might delete cache files.

We use tmpwatch for this.

Setting up multiple cache directories (one for each client process) would be fine as well - the tmpwatch needs to be adjusted to keep the size manageable.

I'll talk with Alex on this personally what our choice would be regarding the specific support ticket.

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2013-03-05: Re: [Bug 1038167] LockError on blob cache

#6

Download full text (5.9 KiB)

Good analysis, Chris. I saw gocept respond agreeing with what Tres and Chris told me, that they need to get the configuration setup right. I'll assign the ticket to them.

--Paul

On Mar 4, 2013, at 10:46 AM, Chris Rossi <email address hidden> wrote:

> I did later do some more research that doesn't seem to have made it into
> the comment thread in this ticket. The essential problem is that there
> isn't a good Blob implementation that matches our case extremely well.
> When you create a ZODB instance with Blob storage, you tell it whether
> or not you want a "shared" blob storage. "Shared" means that all blobs
> live in a single master folder shared by all processes. No blobs are
> stored in the database instance itself. Non-shared means that blobs
> live in a single database instance and are retrieved over a network
> connection, and stored locally in a cache which is neither complete nor
> definitive. The code for this mode assumes that only one process will
> ever access this cache.
>
> The problem for us is neither of these really fits our model well.
> Obviously, with multiple app servers and a single database server, a
> shared blob storage doesn't work for us. The database server must be
> responsible for the complete and definitive set of blobs, and our app
> servers need to retrieve and store them locally as needed. But, since
> the local blob cache is built with the assumption that it will only ever
> be used by a single process we hit the error above.
>
> I think the ideal solution would be to "fix" ZODB such that the local
> blob caches don't require exclusive lock. This however might be
> somewhat expensive in terms of programming hours to fix and probably
> even more expensive in terms of politics of convincing the core devs
> that such a fix is necessary or suitable.
>
> Another solution might be to re-engineer the clusters a bit so that
> there's a common networked disk that could be used as a shared blob
> storage, which is engineered for concurrency.
>
> A more hacky solution would be something like what I suggested back in
> August: have each process generate its own ZODB URI that gives it its
> own blob cache. Some care needs to be taken to make sure these caches
> are removed even when a process exits abnormally. It's not the best
> solution in terms of performance, but it should be feasible.
>
> The last, most expedient solution is simply to stifle an ERROR level log
> messages coming out of zc.lockfile. This should be possible with
> standard library logging configuration. We haven't been given any
> reason to believe the errors have an impact on the end users, so this
> might be a suitable solution.
>
> I suspect, but have not confirmed, that the source of these error
> messages is actually the thread used by ZODB to prune blob caches to
> keep them under the maximum size. If that is a correct hypothesis, it
> may be possible to simply turn off the max limit and use an external
> script to limit the size of the blob cache.
>
> It may be a matter of deciding which solution is the least
> objectionable. Let me know what you think.
>
> --
> You received this bug notification because you are sub...

Good analysis, Chris. I saw gocept respond agreeing with what Tres and Chris told me, that they need to get the configuration setup right. I'll assign the ticket to them.

--Paul

On Mar 4, 2013, at 10:46 AM, Chris Rossi <chris@archimedeanco.com> wrote:

> I did later do some more research that doesn't seem to have made it into
> the comment thread in this ticket.  The essential problem is that there
> isn't a good Blob implementation that matches our case extremely well.
> When you create a ZODB instance with Blob storage, you tell it whether
> or not you want a "shared" blob storage.  "Shared" means that all blobs
> live in a single master folder shared by all processes.  No blobs are
> stored in the database instance itself.  Non-shared means that blobs
> live in a single database instance and are retrieved over a network
> connection, and stored locally in a cache which is neither complete nor
> definitive.  The code for this mode assumes that only one process will
> ever access this cache.
> 
> The problem for us is neither of these really fits our model well.
> Obviously, with multiple app servers and a single database server, a
> shared blob storage doesn't work for us.  The database server must be
> responsible for the complete and definitive set of blobs, and our app
> servers need to retrieve and store them locally as needed.  But, since
> the local blob cache is built with the assumption that it will only ever
> be used by a single process we hit the error above.
> 
> I think the ideal solution would be to "fix" ZODB such that the local
> blob caches don't require exclusive lock.  This however might be
> somewhat expensive in terms of programming hours to fix and probably
> even more expensive in terms of politics of convincing the core devs
> that such a fix is necessary or suitable.
> 
> Another solution might be to re-engineer the clusters a bit so that
> there's a common networked disk that could be used as a shared blob
> storage, which is engineered for concurrency.
> 
> A more hacky solution would be something like what I suggested back in
> August: have each process generate its own ZODB URI that gives it its
> own blob cache.  Some care needs to be taken to make sure these caches
> are removed even when a process exits abnormally.  It's not the best
> solution in terms of performance, but it should be feasible.
> 
> The last, most expedient solution is simply to stifle an ERROR level log
> messages coming out of zc.lockfile.  This should be possible with
> standard library logging configuration.  We haven't been given any
> reason to believe the errors have an impact on the end users, so this
> might be a suitable solution.
> 
> I suspect, but have not confirmed, that the source of these error
> messages is actually the thread used by ZODB to prune blob caches to
> keep them under the maximum size.  If that is a correct hypothesis, it
> may be possible to simply turn off the max limit and use an external
> script to limit the size of the blob cache.
> 
> It may be a matter of deciding which solution is the least
> objectionable.  Let me know what you think.
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1038167
> 
> Title:
>  LockError on blob cache
> 
> Status in KARL3:
>  Won't Fix
> 
> Bug description:
>   Wed Aug 15 06:21:30 2012 ERROR karl Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock; pid=UNKNOWN
>  Details
> 
>  Error locking file /srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock;
>  pid=UNKNOWN
> 
>  Traceback (most recent call last):
>    File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 76, in __init__
>      _lock_file(fp)
>    File "/srv/osfkarl/deploy/6/eggs/zc.lockfile-1.0.0-py2.6.egg/zc/lockfile/__init__.py", line 59, in _lock_file
>      raise LockError("Couldn't lock %r" % file.name)
>  LockError: Couldn't lock '/srv/osfkarl/deploy/6/var/blob_cache/osf/128/.lock'
> 
> 
>  Chris wrote:
> 
>  Whatever it is seems to be done now.  There are a lot more locking
>  errors in the log but they have to do with the lock used by
>  'check_size' to prune the blob_cache.  I think what's going on,
>  though, is we're having multiple processes trying to use the same blob
>  cache and we probably shouldn't really be doing that.  It worked
>  earlier when it was all on one box--then all processes used the blob
>  files used by the database itself and I guess it was anticipated that
>  multiple processes would be accessing that pile.  With the server over
>  on another box we use a non-shared blob_cache, meaning the database
>  server has its pile and the client just has a smaller local cache.  It
>  appears, though, that cache wasn't intended to be used by more than
>  one process at a time so you occasionally see lock errors, although
>  this is the first time I've seen it for something besides the
>  check_size thing which itself, is pretty ancillary.
> 
>  So the long and the short of it is, to be entirely correct in our
>  usage we probably need each process (both webapp procs, mailin,
>  gsa_sync, etc...) to use its own blob cache, so we can avoid locking
>  errors.  And of course, we'll need to make sure the blob_caches are
>  cleaned up when a process exits, else they'll stack up and eat our
>  disk.
> 
>  Fortunately, this seems to be a pretty rare occurrence, so it's
>  probably not super time critical.  We have, after all, been using
>  non-shared blob caches since our switch to gocept.
> 
>  Ok.  One thing I notice is that there isn't a stack trace, which means
>  it isn't being reported by our exception capturing stuff.  It's just
>  something, somewhere, using log.error(...).  This means that whatever
>  is going on it may not have an end user impact.  Regardless, though,
>  it's obviously annoying if it's going to be tripping the alarm.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/1038167/+subscriptions

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2013-03-06:

#7

Adding to M125 and assigning to Alex based on Christian's comment.

Changed in karl3:
assignee:	nobody → Alexander Bittner (yb)
status:	Won't Fix → Confirmed
milestone:	m999 → m125

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2013-03-26:

#8

Christian wrote: "I'll talk with Alex on this personally what our choice would be regarding the specific support ticket."

Alex, did Christian have a chance to talk with you about this?

Revision history for this message

Christian Theune (ctheune) wrote on 2013-03-27:

#9

I can't reven remember whether I talked to Alex. Unfortunately he's sick currently. I'll pick this up again after easter when he'll be back.

Paul Everitt (paul-agendaless) on 2013-04-02

Changed in karl3:
milestone:	m125 → m126

Revision history for this message

Christian Theune (ctheune) wrote on 2013-04-05:

#10

I was able to review this with Alex today.

I think Chris Rossi's analysis isn't 100% correct: the ZEO cache layout assumes that other agents can operate on the directory at the same time and has appropriate locking and waiting behaviour. I can see that relstorage also implements the same protocol.

From my perspective we're tracking down a bug here, so I think we should

* identify the cause of this traceback (I don't understand which part of the application causes the traceback)
* fix the bug
* stay with a single directory
* add external tmpwatch

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2013-04-08:

#11

Analysis from gocept complete, over to us to work on later in the month.

Changed in karl3:
assignee:	Alexander Bittner (yb) → Tres Seaver (tseaver)

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2013-04-15:

#12

This week or next, Tres can look at comment 10 and give his input.

Tres Seaver (tseaver) on 2013-06-19

Changed in karl3:
status:	Confirmed → Won't Fix

KARL3

LockError on blob cache

Bug Description

Other bug subscribers

Remote bug watches