Merge lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes into lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next

Proposed by David Ames
Status: Merged
Merged at revision: 111
Proposed branch: lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes
Merge into: lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next
Diff against target: 107 lines (+36/-11)
2 files modified
hooks/rabbit_utils.py (+6/-1)
hooks/rabbitmq_server_relations.py (+30/-10)
To merge this branch: bzr merge lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes
Reviewer Review Type Date Requested Status
Liam Young (community) Approve
Billy Olsen Needs Resubmitting
Ryan Beisner (community) Needs Resubmitting
Review via email: mp+269395@code.launchpad.net

Description of the change

Fix clustering race conditions for LP Bug#1486177

For juju >= 1.24 run cluster_changed() when the leader-settings-changed hook executes.
For juju < 1.24 use remote_unit() in relation_get when in a native cluster relation hook

See also https://code.launchpad.net/~thedac/charm-helpers/legacy_leadership_peer_retrieve_fix/+merge/269400

To post a comment you must log in.
Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #8155 rabbitmq-server-next for thedac mp269395
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8155/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #8827 rabbitmq-server-next for thedac mp269395
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/8827/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Take note that the amulet test results on this MP are irrelevant to this proposal, as they are the old tests which do not accurately check that each node is a cluster member; They only check that the `rabbitmqctl cluster_status` command exits 0, which will be True even for a 2 or 1-node cluster.

I've re-sync'd these proposed changes into my amulet test refactor branch, and will confirm consistent clustering behavior and report back here.

Thank you for your work on this.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Just noting that this MP is dependent on https://code.launchpad.net/~thedac/charm-helpers/legacy_leadership_peer_retrieve_fix/+merge/269400 landing, pending validation of functional tests of course.

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #6061 rabbitmq-server-next for thedac mp269395
    AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/6061/

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Couple of minor comments below. Would be nice to see some unit tests around this too please.

review: Needs Fixing
Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #8407 rabbitmq-server-next for thedac mp269395
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8407/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #9098 rabbitmq-server-next for thedac mp269395
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9098/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Bah. This now conflicts after next rev 107 (https://code.launchpad.net/~niedbalski/charms/trusty/rabbitmq-server/fix-lp-1489053/+merge/269261).

Can you rebase and resolve conflicts in this MP? I can then pull it back into the test branch, which I'm actively working on atm. Much appreciated.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Thanks for making the changes requested David! I've gone ahead and resolved the merge conflict in my merge (since I contributed to causing the merge conflict). Ryan has given enough testing through his own OSCI runs and such for me to give this one a thumbs up and push it through now.

review: Approve
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Actually, I didn't get to test r107 + this, which definitely is not working: http://paste.ubuntu.com/12242485/

I recommend reverting rmq/next r108 and r107 (back to 106) so we can get this CRIT finished, then land the new tests, and only then re-consider other proposals to rmq when we can reliably test them.

review: Needs Resubmitting
Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #9165 rabbitmq-server-next for thedac mp269395
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9165/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #8471 rabbitmq-server-next for thedac mp269395
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8471/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #6178 rabbitmq-server-next for thedac mp269395
    AMULET FAIL: amulet-test failed

AMULET Results (max last 2 lines):
make: *** [functional_test] Error 1
ERROR:root:Make target returned non-zero.

Full amulet test output: http://paste.ubuntu.com/12247553/
Build: http://10.245.162.77:8080/job/charm_amulet_test/6178/

Revision history for this message
Billy Olsen (billy-olsen) :
review: Needs Resubmitting
Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #8625 rabbitmq-server-next for thedac mp269395
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8625/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #9327 rabbitmq-server-next for thedac mp269395
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9327/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI - leader-settings-changed hook fail (http://paste.ubuntu.com/12270552) with these changes sync'd into next.

111. By David Ames

leader-settings-changed may run before the cluster relation exists.

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #9388 rabbitmq-server-next for thedac mp269395
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9388/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #8683 rabbitmq-server-next for thedac mp269395
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8683/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #6252 rabbitmq-server-next for thedac mp269395
    AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/6252/

112. By David Ames

Get cookie from juju elected leader, update cookie locally and force cluster-relation-changed hooks to run on peers.
leader-election-changed runs only occur when using a version of juju with leader election so there is not the same level of uncertainty in this hook, therefore, force update_cookie to use the leader-get cookie.
peer_echo bails out if we are using LE juju. So we were in a chicken and egg scenario where peers where never clustering. The first attempted fix called cluster_changed() from leader-settings-changed but this had its own problems. Rather poke the cluster relation so that cluster-relation-changed hooks fire which leads to clustering.

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_unit_test #8731 rabbitmq-server-next for thedac mp269395
    UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8731/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_lint_check #9439 rabbitmq-server-next for thedac mp269395
    LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9439/

Revision history for this message
uosci-testing-bot (uosci-testing-bot) wrote :

charm_amulet_test #6257 rabbitmq-server-next for thedac mp269395
    AMULET FAIL: amulet-test failed

AMULET Results (max last 2 lines):
make: *** [functional_test] Error 1
ERROR:root:Make target returned non-zero.

Full amulet test output: http://paste.ubuntu.com/12279205/
Build: http://10.245.162.77:8080/job/charm_amulet_test/6257/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Just a reminder/catchup:

Ignore amulet test results on this MP, as they are the old tests which do not accurately check cluster membership nodes (high risk of false-passing tests), and have multiple opportunities for race conditions to interfere (high risk of false-failing tests).

Instead, we need to base success on whether the refactored tests [1] pass with this MP's changes sync'd into it.

Secondarily, we do also need to manually confirm the same tests succeed with non-LE juju (1.22.6), since the test automation uses LE + 1.24.5 (or whatever is in ppa:juju/stable at the moment).

[1] https://code.launchpad.net/~1chb1n/charms/trusty/rabbitmq-server/amulet-refactor-1509b/+merge/270102

Revision history for this message
Liam Young (gnuoy) wrote :

Thanks for these fixes, 1chb1n has confirmed that were getting clean clustering runs now. Approved.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'hooks/rabbit_utils.py'
2--- hooks/rabbit_utils.py 2015-09-01 04:53:33 +0000
3+++ hooks/rabbit_utils.py 2015-09-05 01:49:46 +0000
4@@ -299,7 +299,12 @@
5 cmd = [RABBITMQ_CTL, 'stop_app']
6 subprocess.check_call(cmd)
7 cmd = [RABBITMQ_CTL, cluster_cmd, 'rabbit@%s' % node]
8- subprocess.check_call(cmd)
9+ try:
10+ subprocess.check_output(cmd, stderr=subprocess.STDOUT)
11+ except subprocess.CalledProcessError as e:
12+ if not e.returncode == 2 or \
13+ "{ok,already_member}" not in e.output:
14+ raise e
15 cmd = [RABBITMQ_CTL, 'start_app']
16 subprocess.check_call(cmd)
17 log('Host clustered with %s.' % node)
18
19=== modified file 'hooks/rabbitmq_server_relations.py'
20--- hooks/rabbitmq_server_relations.py 2015-09-01 04:53:33 +0000
21+++ hooks/rabbitmq_server_relations.py 2015-09-05 01:49:46 +0000
22@@ -81,6 +81,7 @@
23 peer_store,
24 peer_store_and_set,
25 peer_retrieve_by_prefix,
26+ leader_get,
27 )
28
29 from charmhelpers.contrib.network.ip import get_address_in_network
30@@ -304,31 +305,35 @@
31 return
32
33 if is_elected_leader('res_rabbitmq_vip'):
34+ log('Leader peer_storing cookie', level=INFO)
35 cookie = open(rabbit.COOKIE_PATH, 'r').read().strip()
36 peer_store('cookie', cookie)
37
38
39 @hooks.hook('cluster-relation-changed')
40 def cluster_changed():
41+ # Future travelers beware ordering is significant
42+ rdata = relation_get()
43+ # sync passwords
44+ blacklist = ['hostname', 'private-address', 'public-address']
45+ whitelist = [a for a in rdata.keys() if a not in blacklist]
46+ peer_echo(includes=whitelist)
47+
48 cookie = peer_retrieve('cookie')
49 if not cookie:
50 log('cluster_joined: cookie not yet set.', level=INFO)
51 return
52
53- rdata = relation_get()
54 if config('prefer-ipv6') and rdata.get('hostname'):
55- private_address = rdata['private-address']
56- hostname = rdata['hostname']
57+ private_address = rdata.get('private-address')
58+ hostname = rdata.get('hostname')
59 if hostname:
60 rabbit.update_hosts_file({private_address: hostname})
61
62- # sync passwords
63- blacklist = ['hostname', 'private-address', 'public-address']
64- whitelist = [a for a in rdata.keys() if a not in blacklist]
65- peer_echo(includes=whitelist)
66-
67 if not is_sufficient_peers():
68 # Stop rabbit until leader has finished configuring
69+ log('Not enough peers, stopping until leader is configured',
70+ level=INFO)
71 service_stop('rabbitmq-server')
72 return
73
74@@ -358,9 +363,12 @@
75 amqp_changed(relation_id=rid, remote_unit=unit)
76
77
78-def update_cookie():
79+def update_cookie(leaders_cookie=None):
80 # sync cookie
81- cookie = peer_retrieve('cookie')
82+ if leaders_cookie:
83+ cookie = leaders_cookie
84+ else:
85+ cookie = peer_retrieve('cookie')
86 cookie_local = None
87 with open(rabbit.COOKIE_PATH, 'r') as f:
88 cookie_local = f.read().strip()
89@@ -763,6 +771,18 @@
90
91 @hooks.hook('leader-settings-changed')
92 def leader_settings_changed():
93+ # Get cookie from leader, update cookie locally and
94+ # force cluster-relation-changed hooks to run on peers
95+ cookie = leader_get(attribute='cookie')
96+ if cookie:
97+ update_cookie(leaders_cookie=cookie)
98+ # Force cluster-relation-changed hooks to run on peers
99+ # This will precipitate peer clustering
100+ # Without this a chicken and egg scenario prevails when
101+ # using LE and peerstorage
102+ for rid in relation_ids('cluster'):
103+ relation_set(relation_id=rid, relation_settings={'cookie': cookie})
104+
105 # If leader has changed and access credentials, ripple these
106 # out from all units
107 for rid in relation_ids('amqp'):

Subscribers

People subscribed via source and target branches