TransactionRollbackError due to concurrent update could be better handled

Bug #992525 reported by Valentin Lab
166
This bug affects 34 people
Affects Status Importance Assigned to Milestone
Odoo Server (MOVED TO GITHUB)
Fix Released
Low
OpenERP Publisher's Warranty Team

Bug Description

While using openerp, psycopg2 raises TransactionRollbackError quite often even on small database.

This does not seem to be easily reproduceable as it seems to be a conflict between two thread accessing the same table. Nevertheless, I provided a quick video reproducing this while installing "base_crypt" on my computer.

This occurs mostly at module installation. And can completely mess up the module installation by giving empty wizard windows of instance.

I guess it could also occurs in other situations (in multi-user context), where the bug would be quite difficult to reproduce and with unforeseeable consequences ;)

I've spotted an other bug that is due to this it seems: https://bugs.launchpad.net/bugs/956715

In my case (single user), it seem to hit more often on fast computers. To make a probable better guess, it seems to hurt more often whenever using a local connection between the browser and the server. It could be about the web module trying to update the res_users session info and may collide with normal operation.

On my computer, from a new database, installing the 'base_crypt' will trigger the exception.
When using a distant connection, the bug won't show up.

Please check the video I've posted with the bug report if you want to have more detail on the procedure I used. Sorry for the bad sound recording. Note that the video will show you the bug occuring on my computer and NOT occuring on a distant computer.

I'm providing a merge proposal along with this patch which solves the issue for me, but need a patient review.

Related branches

Revision history for this message
Valentin Lab (vaab) wrote :
Revision history for this message
Jignesh Rathod(OpenERP) (jir-openerp) wrote :

Hello Valentin Lab ,

Thanks for your contribution.
Its only producible when you start web with embedded mode.

Thank you!

Changed in openobject-server:
importance: Undecided → Low
status: New → Confirmed
assignee: nobody → OpenERP's Framework R&D (openerp-dev-framework)
summary: - TransactionRollbackError are occuring randomly
+ TransactionRollbackError due to concurrent update could be better
+ handled
Revision history for this message
Olivier Dony (Odoo) (odo-openerp) wrote :

Here is some extra information concerning this bug report:

As discussed on the framework mailing-list[1] and on bug 746620, a TransactionRollbackError may happen under normal circumstances when a certain series of concurrent updates happen in an order that is not safe for the proper serialization of transactions. In other words, it is a fundamental safety mechanism that prevents corrupting data in case of concurrent updates to the same information.
As an example, think of a simplistic bank management system where 2 transactions would run at the same time, read "current balance: $100" and both withdraw $100 by saving "current balance: $0" - an obvious mistake as the final result should be -100$.
This cannot happen in OpenERP as accounting is using a double-entry system, but similar conflicts may happen for other kinds of data.

It is an "optimistic locking" strategy, which means that it will not make the operations block all the time when another operation is using the same resources (which would bring poor performance), but when a conflict is actually detected it will only allow the first transaction to complete, and rollback the orther, as if they had never occurred.
The user can simply retry the failed operation and it will work perfectly fine.

There is one way we could improve this system in OpenERP: we could automatically retry failed transactions after waiting a few milliseconds, because the chances are very high that on the second attempt the transaction will succeed (because the previously conflicting transaction has completed already). This means the user would not even see the failed transaction most of the times.

There is already an attempt to implement this (see the merge proposal linked to this bug), but the implementation is not finished/properly working yet.

[1] https://lists.launchpad.net/openerp-expert-framework/msg00818.html

Revision history for this message
Kyle Waid (midwest) wrote :

Hello,

I am having serious problems with this type of error on 6 different systems. Most of the time it occurs when installing a new module, but other times it happens when running scheduled processes, or just running concurrent processes in general. I reported a similar bug before, but it was marked as invalid. I find this problem exists throughout the program and it is a serious problem in this software release that has existed for a few months but has not been fixed. I was going to migrate a customer to 6.1, but now I strongly reconsider due to this issue. I see stability issues within the software that makes it unusable in a production environment with many users. This problem as you know does not exist in previous releases.

Revision history for this message
Kyle Waid (midwest) wrote :

Also, this is a core system problem and I would have reconsidered releasing the version to the public with such an issue.

Revision history for this message
Olivier Dony (Odoo) (odo-openerp) wrote :

This error is only supposed to happen when two transactions read *then* write data at the same time in the database, in a manner that may have produced different results if they were executed one after the other instead.
That should not happen frequently even when many users are working on the system at the same time, unless the users are all modifying the same data exactly at the same time.

Now, there are a few things that might cause this issue to happen too frequently, such as:
- two persons sharing the same login
- PostgreSQL 9.1 used in combination with an old psycopg2 Python library version (older than 2.4.3)

You can find out the psycopg2 version with this command on the OpenERP server machine:
      $ python -c 'import psycopg2;print psycopg2.__version__'
       2.4.3 (dt dec mx ext pq3)

Older Pyscopg2 versions did not fully support the transaction isolation mechanisms of PostgreSQL 9.1, and might cause this low-level protection to trigger too often.

In other cases, the trigger of this error is a protection mechanism that ensures proper data integrity, and is essential for the safety of the transactions. It is likely it did not happen as often with OpenERP 6.0, and the reason is that the protection mechanism used in OpenERP 6.0 was not as efficient, and may have missed certain kind of dangerous concurrent updates.
OpenERP 6.1 is fully using the Snapshot Serialization Isolation mechanism of PostgreSQL, an industry-standard solution for maintaining transaction integrity [1].

As discussed, we can further improve the user experience by adding an auto-retry mechanism in the transaction handling code, but the frequency of such errors is supposed to be quite low under normal circumstances.

[1] http://en.wikipedia.org/wiki/Snapshot_isolation

Revision history for this message
Fabian Semal (fabian.semal) wrote :

Hello,

Just encountered the problem after upgrading a old 6.1 db to last revision.

If it helps, here is how to solve it:

On Ubuntu Lucid :

apt-get remove --purge psycopg2
wget http://www.psycopg.org/psycopg/tarballs/PSYCOPG-2-4/psycopg2-2.4.5.tar.gz
tar xvzf psycopg2-2.4.5.tar.gz
cd psycopg2-2.4.5
python setup.py install

On Ubuntu Oneiric :

sudo pip install --upgrade psycopg2

Regards,

Fabian

Revision history for this message
Marcos Mendez (marcos-hr7) wrote :

Hi,

I've reproduced this concurrency problem easily. Please look at https://bugs.launchpad.net/openobject-server/+bug/1013223.

My setup is:

ubuntu 12.04 lts (all updates)
python 2.7.3
postgresql 9.1
openerp 6.1
psycopg2 2.4.5

Basically, ten users trying to create a partner at the same time will fail - not all will work. I have provided the instructions on how to reproduce the error with JMeter. There are many errors in the log.

Please revisit this. I would hope that OpenERP can handle more than 5-8 concurrent users doing reads and writes.

Revision history for this message
Olivier Dony (Odoo) (odo-openerp) wrote : Re: [Bug 992525] Re: TransactionRollbackError due to concurrent update could be better handled

On 06/18/2012 03:20 PM, Marcos Mendez wrote:
> Basically, ten users trying to create a partner at the same time will
> fail - not all will work. I have provided the instructions on how to
> reproduce the error with JMeter. There are many errors in the log.
>
> Please revisit this. I would hope that OpenERP can handle more than 5-8
> concurrent users doing reads and writes.

It seems your test is heavily biased. Unless I missed something, you are:
- simulating concurrent users that are all using the same login
- making the virtual users login before each request

The login call that is performed before each query is not only unnecessary but
will also cause an update to the user record in the database (to change the
last login time). And because all users are sharing the same login, this can
cause concurrent transactions to be invalidated due to potential conflicts on
the shared user record.

Please take these two factors out of the equation and see if you can reproduce
the problem.

It is valid to perform batch transactions using a unique login, e.g. for web
services integration, but if you do that in production you'll want to pay
attention to the following:
- No need to call login() every time, it's only used to get the user id (and
update the last login time, which you don't care about in this context)
- You *have to* implement a transaction queue, and implement appropriate error
handling! Things could go wrong anywhere in the flow: network, hardware, disk
space, database server, openerp server -> you *must* handle the various cases
appropriately, or you're asking for trouble in the future. In most cases the
solution will be to requeue the request, and this will work as well for the
error we're discussing here, if it ever happens.

PS: bug 1013223 does seem to be a duplicate of this bug

Revision history for this message
Marcos Mendez (marcos-hr7) wrote :
Download full text (4.2 KiB)

Agreed. I will modify the test bias and send updates. We will also have automated systems, so this was trying to simulate that also. We can do the login once. I just expect OpenERP to handle a normal load. So will change the test to remove that bias.

Thanks

On Jun 18, 2012, at 9:59 AM, Olivier Dony (OpenERP) wrote:

> On 06/18/2012 03:20 PM, Marcos Mendez wrote:
>> Basically, ten users trying to create a partner at the same time will
>> fail - not all will work. I have provided the instructions on how to
>> reproduce the error with JMeter. There are many errors in the log.
>>
>> Please revisit this. I would hope that OpenERP can handle more than 5-8
>> concurrent users doing reads and writes.
>
> It seems your test is heavily biased. Unless I missed something, you are:
> - simulating concurrent users that are all using the same login
> - making the virtual users login before each request
>
> The login call that is performed before each query is not only unnecessary but
> will also cause an update to the user record in the database (to change the
> last login time). And because all users are sharing the same login, this can
> cause concurrent transactions to be invalidated due to potential conflicts on
> the shared user record.
>
> Please take these two factors out of the equation and see if you can reproduce
> the problem.
>
> It is valid to perform batch transactions using a unique login, e.g. for web
> services integration, but if you do that in production you'll want to pay
> attention to the following:
> - No need to call login() every time, it's only used to get the user id (and
> update the last login time, which you don't care about in this context)
> - You *have to* implement a transaction queue, and implement appropriate error
> handling! Things could go wrong anywhere in the flow: network, hardware, disk
> space, database server, openerp server -> you *must* handle the various cases
> appropriately, or you're asking for trouble in the future. In most cases the
> solution will be to requeue the request, and this will work as well for the
> error we're discussing here, if it ever happens.
>
> PS: bug 1013223 does seem to be a duplicate of this bug
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (1013223).
> https://bugs.launchpad.net/bugs/992525
>
> Title:
> TransactionRollbackError due to concurrent update could be better
> handled
>
> Status in OpenERP Server:
> Confirmed
>
> Bug description:
> While using openerp, psycopg2 raises TransactionRollbackError quite
> often even on small database.
>
> This does not seem to be easily reproduceable as it seems to be a
> conflict between two thread accessing the same table. Nevertheless, I
> provided a quick video reproducing this while installing "base_crypt"
> on my computer.
>
> This occurs mostly at module installation. And can completely mess up
> the module installation by giving empty wizard windows of instance.
>
> I guess it could also occurs in other situations (in multi-user
> context), where the bug would be quite difficult to reproduce and with
> unforeseeable consequences ;)
>
> I've spotted an ...

Read more...

Revision history for this message
Marcos Mendez (marcos-hr7) wrote : Re: [Bug 992525] TransactionRollbackError due to concurrent update could be better handled
Download full text (4.9 KiB)

Hi,

Is there some code/example about how to make a user a manager of a group? I'm trying to automate the creation of users to do this test properly. I can create the user, but just assigning group_ids is not enough when creating the user.

For example:
# CREATE USER
data = {
#'groups_id': [9, 2, 8, 19, 23, 21, 13, 11, 17, 7, 12, 18, 20, 1, 22, 3],
'password': 'test',
'active': True,
'name': 'Test',
'login': 'test',
}

sock = xmlrpclib.ServerProxy('http://%s:8069/xmlrpc/object'%host)
id = sock.execute(dbname, uid, pwd, model, 'create', data)

On Jun 18, 2012, at 11:25 AM, Marcos Mendez wrote:

> Agreed. I will modify the test bias and send updates. We will also have automated systems, so this was trying to simulate that also. We can do the login once. I just expect OpenERP to handle a normal load. So will change the test to remove that bias.
>
> Thanks
>
> On Jun 18, 2012, at 9:59 AM, Olivier Dony (OpenERP) wrote:
>
>> On 06/18/2012 03:20 PM, Marcos Mendez wrote:
>>> Basically, ten users trying to create a partner at the same time will
>>> fail - not all will work. I have provided the instructions on how to
>>> reproduce the error with JMeter. There are many errors in the log.
>>>
>>> Please revisit this. I would hope that OpenERP can handle more than 5-8
>>> concurrent users doing reads and writes.
>>
>> It seems your test is heavily biased. Unless I missed something, you are:
>> - simulating concurrent users that are all using the same login
>> - making the virtual users login before each request
>>
>> The login call that is performed before each query is not only unnecessary but
>> will also cause an update to the user record in the database (to change the
>> last login time). And because all users are sharing the same login, this can
>> cause concurrent transactions to be invalidated due to potential conflicts on
>> the shared user record.
>>
>> Please take these two factors out of the equation and see if you can reproduce
>> the problem.
>>
>> It is valid to perform batch transactions using a unique login, e.g. for web
>> services integration, but if you do that in production you'll want to pay
>> attention to the following:
>> - No need to call login() every time, it's only used to get the user id (and
>> update the last login time, which you don't care about in this context)
>> - You *have to* implement a transaction queue, and implement appropriate error
>> handling! Things could go wrong anywhere in the flow: network, hardware, disk
>> space, database server, openerp server -> you *must* handle the various cases
>> appropriately, or you're asking for trouble in the future. In most cases the
>> solution will be to requeue the request, and this will work as well for the
>> error we're discussing here, if it ever happens.
>>
>> PS: bug 1013223 does seem to be a duplicate of this bug
>>
>> --
>> You received this bug notification because you are subscribed to a
>> duplicate bug report (1013223).
>> https://bugs.launchpad.net/bugs/992525
>>
>> Title:
>> TransactionRollbackError due to concurrent update could be better
>> handled
>>
>> Status in OpenERP Server:
>> Confirmed
>>
>> Bug description:
>> While using openerp, ...

Read more...

Revision history for this message
Kyle Waid (midwest) wrote :

Hello, I receive the error in many different ways. One way I was able to reproduce it,

3 different users (unique logins) try to confirm production of a bill of materials. Granted it was the same one, 2 users received the error while the third succeeded.

Another way I receive the error frequently is installing new modules, I have been able to reproduce the problem on 10 systems with current code easily, but other openerp core developers say they cant reproduce it. The problem seems worse when trying to install module from the kanban view. The issue besides the error message is not being able to run the configuration wizards.

Another way I was able to reproduce it is by trying to import orders using magento connection. I see the error a lot in this way. Olivier you explain a possible relief, I will see about having this developed in a module or something

Changed in openobject-server:
assignee: OpenERP's Framework R&D (openerp-dev-framework) → OpenERP Publisher's Warranty Team (openerp-opw)
Revision history for this message
Olivier Dony (Odoo) (odo-openerp) wrote :

We've just written a patch to implement this auto-retry logic in all affected branches: 6.0, 6.1, 7.0 (and trunk).
It should be fairly safe and without much side-effect: it will try to replay RPC calls that result in a transaction rollback, caused by one of these 3 PostgreSQL error codes[1]:

 - SERIALIZATION_FAILURE (40001 - "cannot serialize transactions due to concurrent update")
 - DEADLOCK_DETECTED (40P01)
 - LOCK_NOT_AVAILABLE (50P03 - "could not obtain lock on row in relation ...")

Each of these errors is transient and caused by the presence of another concurrent transaction working on the same database entries. The likelihood of seeing that other transaction committed increases with every passing millisecond, so in most cases it should be sufficient to retry once after a little while.
After testing this patch with several clients hammering the server at the same time, we noticed that having 3-4 retries with several hundred milliseconds randomized delay seems to be allow them all to pass, whereas if we retry only once we still get a few failures when there are more than 2 concurrent transactions doing the same thing.

Concerning the side-effects, the failed transactions have just been rolled back, so replaying them is correct on a semantic level. In rare cases the rolled back transaction might have had a side effect on the rest of the world (e.g. sent an email or written a file), so replaying it might cause the side-effect to occur a second time. However this would be true even with manual replay instead of automatic replay - the user could simply press the same button again to retry. Basically we're just assuming the user did mean the transaction to happen so we're pressing the button again for her.

We've though of making the retry delay and/or count configurable, but the defaults should be fine for most cases. And if the default values are not good enough a proper analysis of the concurrency issue would probably be better than bumping up the settings without understanding them. With the default settings the auto-retry could delay the transaction for up to several dozen seconds, which already seems like a very large limit. Most auto-retried transactions will not be delayed for more than a few hundred milliseconds though.

Any feedback/tests for these sensitive patches would be appreciated. We're planning to merge them soon unless a problem is detected.

Thanks!

[1] see http://www.postgresql.org/docs/current/static/errcodes-appendix.html#ERRCODES-TABLE and http://initd.org/psycopg/docs/errorcodes.html

Changed in openobject-server:
status: Confirmed → Fix Committed
Changed in openobject-server:
status: Fix Committed → Fix Released
Revision history for this message
Guewen Baconnier @ Camptocamp (gbaconnier-c2c) wrote :

The branches are not merged, the status of the bug should not be 'Fix Released'.

Can someone put the status back to 'Fix Committed'?

Changed in openobject-server:
status: Fix Released → Fix Committed
Revision history for this message
Olivier Dony (Odoo) (odo-openerp) wrote :

A correction (described in comment #13) has finally landed in the following branches:

- In OpenERP 6.0, as of openobject-server rev. 3655 rev-id <email address hidden>
- In OpenERP 6.1, as of openobject-server rev. 4338 rev-id <email address hidden>
- In OpenERP 7.0, as of openobject-server rev. 4850 rev-id <email address hidden>

Thanks to anyone who contributed to the resolution, either by investigating, proposing patches or reviewing them!

Changed in openobject-server:
milestone: none → 6.0.4
status: Fix Committed → Fix Released
Revision history for this message
Kyle Waid (midwest) wrote :

Hi,

Know this is a really old post. I was looking at openerp/sql_db.py in v7 and it looks like OpenERP uses "Repeatable Read" and not Serialized isolation.

Further,

from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT, ISOLATION_LEVEL_READ_COMMITTED, ISOLATION_LEVEL_REPEATABLE_READ

            isolation_level = ISOLATION_LEVEL_REPEATABLE_READ \
                                  if self._serialized \
                                  else ISOLATION_LEVEL_READ_COMMITTED

I do not see anywhere in the code where from psycopg2 serialized isolation is imported or used in the code.

Thanks!

Revision history for this message
Kyle Waid (midwest) wrote :

I answered my own question:

        As a result of the above, we have selected ``REPEATABLE READ`` as
        the default transaction isolation level for OpenERP cursors, as
        it will be mapped to the desired ``snapshot isolation`` level for
        all supported PostgreSQL version (8.3 - 9.x).

However, from the discussion above it seems misleading as to suggest openerp uses serializable level transaction isolation when in fact it does not

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.