Optimization: Don't rebuild revoke-tree in each validate-token call

Bug #1503312 reported by Sonali
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
Invalid
Medium
Unassigned

Bug Description

For checking an entry in revocation_event table, keystone builds a tree comprising all the entries in the table and searches for a matching token by traversing the tree. Every validate token call rebuilds the tree - this includes getting all entries from the db, converting them into dict objects for manipulation and then building the tree. We observed that the conversion of db entries to objects is what causes the increase in latency and not the building of the tree or the tree traversal. This conversion and subsequent rebuilds are unnecessary and can be avoided. With this patch, the first call to the get_revoke_tree method will build the tree and subsequent calls will not rebuild the tree unless the tree has been invalidated by a new addition to the revocation_event table.

Sonali (sonali-pitre)
Changed in keystone:
assignee: nobody → Sonali (sonali-pitre)
Revision history for this message
Anithra P Janakiraman (janithra) wrote :

Related bugs - 1287757 & 1456797. The fix for 1287757 addresses the issue partially by removing the purge on validate, but we still see deadlocks and a huge increase in latency with an increase in revocation table entries due to the time taken for db entries to objects conversion as mentioned in the Bug description.

Revision history for this message
Dolph Mathews (dolph) wrote :

I'm tagging this with Fernet only because of the potential performance improvement here for Fernet-based deployments, which are heavily dependent on the performance of token validation and thus revocation events.

Changed in keystone:
importance: Undecided → Medium
tags: added: fernet
Changed in keystone:
status: New → Triaged
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

I don't see a patch?

Revision history for this message
Dolph Mathews (dolph) wrote :

I was hoping to find one in gerrit that wasn't linked to the bug, but I don't see one yet either.

Revision history for this message
Anithra P Janakiraman (janithra) wrote :

Hi Dolph, Clint,
 We havent submitted the patch yet, having some issues setting up the environment to run the unit tests. Will upload shortly. Sorry!

Revision history for this message
Dolph Mathews (dolph) wrote :

While I appreciate that you want to run the unit tests offline, I'm also anxious to see this patch and would suggest that you upload it anyway, mark it as Work In Progress and let OpenStack-infra run the tests in the mean time :)

Revision history for this message
Dolph Mathews (dolph) wrote :

This is closely related to bug 1471665 in that a solution to this bug will provide some relief there (although we still likely need a sledgehammer-type solution to actually close that one).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to keystone (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/232566

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to keystone (master)

Fix proposed to branch: master
Review: https://review.openstack.org/232715

Changed in keystone:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on keystone (stable/kilo)

Change abandoned by Anithra P Janakiraman (<email address hidden>) on branch: stable/kilo
Review: https://review.openstack.org/232566
Reason: Abandoning patch for the kilo Branch, As per Steve's suggestion will backport the patch from master.

Revision history for this message
Adam Young (ayoung) wrote :

Before we make any changes here, can someone please verify that this is in-fact, a bug. There is a MEMOIZE call on the tree building function that should prevent the tree from being built every time.

Revision history for this message
Dolph Mathews (dolph) wrote :

As ayoung suggested, do you have caching enabled?

Changed in keystone:
status: In Progress → Incomplete
Revision history for this message
Anithra P Janakiraman (janithra) wrote :

Hi dolph, ayoung, We tried after enabling caching and it does work fine on a single keystone node setup. However on a multi node setup, the revocation tree is rebuilt only on the node which handles the revoke request.. All other trees are stale till each node receives a revoke request. So, there is a possibility of a revoked token being accepted.

Revision history for this message
Dolph Mathews (dolph) wrote :

Does each keystone node have it's own cache, or are you sharing a single memcache instance (for example) with all nodes?

Revision history for this message
Anithra P Janakiraman (janithra) wrote :

We use a set of memcache servers. This is our config:
[cache]
backend = keystone.cache.memcache_pool
enabled = true
memcache_servers = 10.x.y.1:11211, 10.x.y.2:11211, 10.x.y.3:11211
expiration_time=1200

Revision history for this message
Anithra P Janakiraman (janithra) wrote :

The order of entries in the list is the same in all nodes (Remember seeing another bug which talked about issues when the order is changed so checked this).

Revision history for this message
Sonali (sonali-pitre) wrote :

Marking the bug as In-progress.
Please let us know any more information is needed.

Changed in keystone:
status: Incomplete → In Progress
Revision history for this message
Haneef Ali (haneef) wrote :

Just a note, though in theory it UUID doesn't need revoke_events, it also affects UUID tokens

Revision history for this message
Anithra P Janakiraman (janithra) wrote :

ayoung, dolph, please let us know if you agree that this is an issue.

Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

What version of keystone is this? We have some caching in some versions but there were issues.

Also are you doing master-master-master cluster with galera and writing to all of them? In my experience it is important to only write to a single node in galera unless you've heavily tested multi-master-writing (sadly we have not).

It is unlikely that the caching is actually an issue here, it is likely that you are either invalidating the cache (by new events being issued), have a subtle difference in the event query happening on different keystones (resulting in a cache miss), or similar.

Since you are complaining of deadlocks, we can pretty much eliminate caching as the problem.

This bug should also not be "in-progress" if there has not been a patch posted.

Changed in keystone:
status: In Progress → Incomplete
Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

If this only affects kilo then this should be targeted specifically to kilo not master.

Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

Sorry mis-read the abandonment of the patch, thought master was abandoned.

Changed in keystone:
status: Incomplete → In Progress
tags: added: revoke
Revision history for this message
Anithra P Janakiraman (janithra) wrote :

thanks for the comments Morgan Fainberg, will re test this once more and post exactly what we did and the findings. We do use a multi-node setup. We dont see deadlocks when caching is enabled. The only issue with caching that we had was, the revoke_tree was invalidated in one node and not invalidated on another node for a short period of time , causing a stale tree to be used.
Will retest and post the results in a couple of days.

Revision history for this message
Anithra P Janakiraman (janithra) wrote :
Download full text (3.7 KiB)

We still see a problem with memcache when using more than one keystone node/memcache server. There is no issue with the db.
Configuration:
[cache]
backend = keystone.cache.memcache_pool
enabled = true
memcache_servers = 10.x.y.1:11211, 10.x.y.2:11211, 10.x.y.3:11211
expiration_time=1200
When we enable caching, we see that when a token is revoked in one node, a subsequent check_token request on that node results in the tree being rebuilt on that node... but the other keystone nodes allow some tokens to be checked before a rebuild of the tree. This seems like a sync issue. Pasting some output from the logs on two nodes , (these are custom logs added to
check when an entry is revoked/tree is rebuilt). In Node 2 one token has been checked without the tree being rebuilt but a subsequent request causes the tree to be rebuilt. Both nodes have the same time.

Node 1:

2016-03-21 10:38:27.193 28085 INFO keystone.common.wsgi [req-2891e182-c25f-4c4a-8bc1-0d3fc3532c2f - - - - -] DELETE https://keystone.service.os:35357/v2.0/users/10f0105c591f4d92816165146d3e9664
2016-03-21 10:38:27.205 28086 WARNING keystone.contrib.revoke.core [req-9c15cfbf-831a-4be2-a7e9-cedc69e04ceb - - - - -] CUSTOM: in revoke
2016-03-21 10:38:27.208 28086 WARNING keystone.contrib.revoke.core [req-39b18eb1-291f-46a0-a807-8fdb3b7ba86d - - - - -] CUSTOM: in revoke
2016-03-21 10:38:27.223 28086 INFO keystone.common.kvs.core [req-9c15cfbf-831a-4be2-a7e9-cedc69e04ceb - - - - -] Using default dogpile sha1_mangle_key as KVS region token-driver key_mangler

Node 2::

2016-03-21 10:38:27.206 6478 WARNING keystone.contrib.revoke.core [req-b215ab41-b0d8-433e-a971-a34b45a7cf72 - - - - -] CUSTOM : in check token
2016-03-21 10:38:27.206 6478 WARNING keystone.contrib.revoke.core [req-b215ab41-b0d8-433e-a971-a34b45a7cf72 - - - - -] {'access_token_id': None, 'project_id': u'bdf5fb89fcaa4faa898d3157b0e6785b', 'user_id': u'de3cebf70675407c894e4574b963a1ac', 'roles': [u'9fe2ff9ee4384b1894a90878d3e92bab', u'1d57a2ba441c4bccb25a85ab0a11ce95'], 'audit_id': 'ovDmHMhgRtm70cmuS6ASIw', 'trustee_id': None, 'trustor_id': None, 'expires_at': datetime.datetime(2016, 3, 21, 11, 38, 26), 'consumer_id': None, 'assignment_domain_id': u'default', 'issued_at': datetime.datetime(2016, 3, 21, 10, 38, 27), 'identity_domain_id': u'default', 'audit_chain_id': 'ovDmHMhgRtm70cmuS6ASIw', 'trust_id': None}
2016-03-21 10:38:27.210 6478 INFO keystone.common.wsgi [req-b215ab41-b0d8-433e-a971-a34b45a7cf72 - - - - -] DELETE https://keystone.service.os:35357/v2.0/users/2ee5ceda37ba42e5b540fece70203451
2016-03-21 10:38:27.237 6479 WARNING keystone.contrib.revoke.core [req-18928f40-1653-4b24-80e0-bea065ce0dfa - - - - -] CUSTOM: in check token
2016-03-21 10:38:27.237 6479 WARNING keystone.contrib.revoke.core [req-18928f40-1653-4b24-80e0-bea065ce0dfa - - - - -] {'access_token_id': None, 'project_id': u'bdf5fb89fcaa4faa898d3157b0e6785b', 'user_id': u'de3cebf70675407c894e4574b963a1ac', 'roles': [u'9fe2ff9ee4384b1894a90878d3e92bab', u'1d57a2ba441c4bccb25a85ab0a11ce95'], 'audit_id': 'aCJFgymrQa6yjKv4CuIQgw', 'trustee_id': None, 'trustor_id': None, 'expires_at': datetime.datetime(2016, 3, 21, 11, 38, 26), 'consumer_id': None, 'assignmen...

Read more...

Changed in keystone:
assignee: Sonali (sonali-pitre) → nobody
Navid Pustchi (npustchi)
Changed in keystone:
assignee: nobody → Navid Pustchi (npustchi)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on keystone (master)

Change abandoned by ayoung (<email address hidden>) on branch: master
Review: https://review.openstack.org/232715
Reason: Superceded by work to replace tree with list. Should solve this issue, too

Revision history for this message
Lance Bragstad (lbragstad) wrote :

This should be addressed (at least partially) with https://review.openstack.org/#/c/311652/ ?

Revision history for this message
Steve Martinelli (stevemar) wrote :

We replaced the tree with a linear search: https://review.openstack.org/#/c/311652/ - closing this one as invalid now, don't expect the patch to be backported to Mitaka or Liberty, it's too complex to introduce into our stable branches.

Changed in keystone:
status: In Progress → Invalid
assignee: Navid Pustchi (npustchi) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.