MAAS

Same node handed out in response to two parallel 'acquire' requests

Bug #1382575 reported by Jason Hobbs on 2014-10-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Critical	Jason Hobbs	MAAS 1.7.0

Bug Description

I have a script that uses the cli to acquire nodes in quick succession in parallel, in two separate processes. I do 30 acquire node requests, and all return successfully. However, less than 30 unique system-ids are returned, so sometimes MAAS is handing out the same system ID in response to multiple node acquire requests.

This was initially reported by a user and I have confirmed it on my own setup, on MAAS 1.7.0~beta6+bzr3260-0ubuntu1~trusty1

Tags:

Related branches

lp:~jason-hobbs/maas/lock-around-node-acquire

Merged into lp:~maas-committers/maas/trunk at revision 3272

Gavin Panella (community): Approve on 2014-10-20

Blake Rouse (community): Approve on 2014-10-20

Raphaël Badin: Pending requested 2014-10-20

lp:~jason-hobbs/maas/run-single-process-single-thread

Rejected for merging into lp:~maas-committers/maas/trunk

Andres Rodriguez (community): Disapprove on 2014-10-19

Christian Reis (community): Needs Information on 2014-10-17

lp:~andreserl/maas/fix_lp1382575

Rejected for merging into lp:~maas-committers/maas/trunk

Gavin Panella (community): Disapprove on 2014-10-19

MAAS Maintainers: Pending requested 2014-10-19

lp:ubuntu/utopic-proposed/maas

lp:~andreserl/maas/pkg_beta8_release

Merged into lp:~maas-maintainers/maas/packaging.utopic at revision 332

Julian Edwards (community): Approve on 2014-10-23

Christian Reis (community): Approve on 2014-10-23

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2014-10-17: Re: Same node handed out in response to two parallel 'acquire' requests from the same user

parallel_acquire.sh Edit (222 bytes, text/x-sh)

I've attached a script that reproduces the issue.

You can work around this issue by editing the first line of "/etc/maas/maas-http.conf" changing 'processes=2' to 'processes=1'.

summary:

- Same node handed out in two parallel 'acquire' requests from the same
- user
+ Same node handed out in response to two parallel 'acquire' requests from
+ the same user

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2014-10-17:

This also happens when there are two users making the requests.

summary:	- Same node handed out in response to two parallel 'acquire' requests from - the same user + Same node handed out in response to two parallel 'acquire' requests
Changed in maas:
status:	Triaged → In Progress
assignee:	nobody → Jason Hobbs (jason-hobbs)

Revision history for this message

Graham Binns (gmb) wrote on 2014-10-17:

Node.acquire() doesn't commit after save(), so the state change won't be committed until the request completes. That's where the race lies. We could also perhaps take out a lock on the table when acquiring the node; Gavin, can you weigh in on that?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2014-10-17:

Adding a transaction.commit() in node.acquire(), after the save, doesn't fix this, so I don't think django's transaction commit timing is the full story.

Revision history for this message

Gavin Panella (allenap) wrote on 2014-10-17:

At the isolation level we're at — read committed — the database doesn't
protect us from concurrent updates to the same object. Two threads can
read the object at the same time, update the status and user, then save
it back, with the last one "winning".

Running with serialisable isolation would cause the last of these to
commit to fail, so we'd eliminate the race. However, we'd want to retry
the failed operation.

PostgreSQL also offers repeatable read isolation. This may be a useful
middle-ground between read committed and serialisable, but I don't know
enough about it to comment more.

Django makes it simple enough to set the isolation level, but doesn't
provide retry support, as far as I can tell. It makes things somewhat
harder to do ourselves too, because it tries to normalise exceptions
coming from the database, semi-smothering the exception we'd want to
trigger a retry on.

Post 1.7 my thoughts are leaning towards greater isolation between
transactions, and more explicit transaction handling in our code:

- Increase the isolation level.

- Do not automatically wrap the request in a transaction.

- Forbid database access unless a transaction is explicitly begun.

- Create a context manager that runs the wrapped function within a
  transaction, but also detects serialisation errors and retries a
  finite number of times. Use this in preference to Django's atomic()
  decorator.

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2014-10-19:

Gavin,

Your ISOLATION changes to the DB are the ones that caused this issue. We had come across this issue before and https://bugs.launchpad.net/juju-core/+bug/1314409 was the fix for it. This, to me, seems like a critical regression and changes need to be reverted.

Thanks

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2014-10-19:

https://code.launchpad.net/~andreserl/maas/fix_lp1382575/+merge/238817

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2014-10-19:

This also needs to be backported to 1.7

MAAS Lander (maas-lander) on 2014-10-20

Changed in maas:
status:	In Progress → Fix Committed

Revision history for this message

Christian Reis (kiko) wrote on 2014-10-21:

This is a redux of bug 1314409.

Just for completeness sake, we discussed this at length this week and we can't find where this has regressed; in fact, the isolation level as far as we can tell has never changed; it was erroneously thought to have been changed in bug 1314409, but we don't think that change ever actually did anything.

If that's true, and this indeed has regressed from 1.5, it's not a factor of changing the isolation level, but something else in the lifecycle transitions that is making this prone to races. I'm not sure what that could be.

Revision history for this message

Christian Reis (kiko) wrote on 2014-10-21:

#10

The real, deeper fix is to change the isolation level in the database, work which Gavin currently has underway: https://code.launchpad.net/~allenap/maas/transactions-redux-the-revenge

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-10-22: Re: [Bug 1382575] Re: Same node handed out in response to two parallel 'acquire' requests

#11

On Tuesday 21 Oct 2014 18:57:28 you wrote:
> If that's true, and this indeed has regressed from 1.5, it's not a
> factor of changing the isolation level, but something else in the
> lifecycle transitions that is making this prone to races. I'm not sure
> what that could be.

I rather suspect it's blind luck or the calling environment that originally
experienced the bug had some subtle change.

Julian Edwards (julian-edwards) on 2014-11-19

Changed in maas:
status:	Fix Committed → Fix Released

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2015-07-28:

#12

This bug was filed in upstream MAAS and not in Ubuntu. It was fixed as part of 1.7. This was fixed and verified to be working in all Ubuntu releases. Ubuntu 1.7 is being SRU'd. Marking this as verification-done, as it seems to be blocking SRU.

tags:

added: verification-done