Merge lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes into lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next
- Trusty Tahr (14.04)
- native-cluster-race-fixes
- Merge into next
Status: | Merged |
---|---|
Merged at revision: | 111 |
Proposed branch: | lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes |
Merge into: | lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next |
Diff against target: |
107 lines (+36/-11) 2 files modified
hooks/rabbit_utils.py (+6/-1) hooks/rabbitmq_server_relations.py (+30/-10) |
To merge this branch: | bzr merge lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Liam Young (community) | Approve | ||
Billy Olsen | Needs Resubmitting | ||
Ryan Beisner (community) | Needs Resubmitting | ||
Review via email: mp+269395@code.launchpad.net |
Commit message
Description of the change
Fix clustering race conditions for LP Bug#1486177
For juju >= 1.24 run cluster_changed() when the leader-
For juju < 1.24 use remote_unit() in relation_get when in a native cluster relation hook
See also https:/
uosci-testing-bot (uosci-testing-bot) wrote : | # |
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_lint_check #8827 rabbitmq-
LINT OK: passed
Ryan Beisner (1chb1n) wrote : | # |
Take note that the amulet test results on this MP are irrelevant to this proposal, as they are the old tests which do not accurately check that each node is a cluster member; They only check that the `rabbitmqctl cluster_status` command exits 0, which will be True even for a 2 or 1-node cluster.
I've re-sync'd these proposed changes into my amulet test refactor branch, and will confirm consistent clustering behavior and report back here.
Thank you for your work on this.
Ryan Beisner (1chb1n) wrote : | # |
Just noting that this MP is dependent on https:/
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_amulet_test #6061 rabbitmq-
AMULET OK: passed
Build: http://
Billy Olsen (billy-olsen) wrote : | # |
Couple of minor comments below. Would be nice to see some unit tests around this too please.
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_unit_test #8407 rabbitmq-
UNIT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_lint_check #9098 rabbitmq-
LINT OK: passed
Ryan Beisner (1chb1n) wrote : | # |
Bah. This now conflicts after next rev 107 (https:/
Can you rebase and resolve conflicts in this MP? I can then pull it back into the test branch, which I'm actively working on atm. Much appreciated.
Billy Olsen (billy-olsen) wrote : | # |
Thanks for making the changes requested David! I've gone ahead and resolved the merge conflict in my merge (since I contributed to causing the merge conflict). Ryan has given enough testing through his own OSCI runs and such for me to give this one a thumbs up and push it through now.
Ryan Beisner (1chb1n) wrote : | # |
Actually, I didn't get to test r107 + this, which definitely is not working: http://
I recommend reverting rmq/next r108 and r107 (back to 106) so we can get this CRIT finished, then land the new tests, and only then re-consider other proposals to rmq when we can reliably test them.
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_lint_check #9165 rabbitmq-
LINT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_unit_test #8471 rabbitmq-
UNIT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_amulet_test #6178 rabbitmq-
AMULET FAIL: amulet-test failed
AMULET Results (max last 2 lines):
make: *** [functional_test] Error 1
ERROR:root:Make target returned non-zero.
Full amulet test output: http://
Build: http://
Billy Olsen (billy-olsen) : | # |
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_unit_test #8625 rabbitmq-
UNIT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_lint_check #9327 rabbitmq-
LINT OK: passed
Ryan Beisner (1chb1n) wrote : | # |
FYI - leader-
- 111. By David Ames
-
leader-
settings- changed may run before the cluster relation exists.
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_lint_check #9388 rabbitmq-
LINT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_unit_test #8683 rabbitmq-
UNIT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_amulet_test #6252 rabbitmq-
AMULET OK: passed
Build: http://
- 112. By David Ames
-
Get cookie from juju elected leader, update cookie locally and force cluster-
relation- changed hooks to run on peers.
leader-election- changed runs only occur when using a version of juju with leader election so there is not the same level of uncertainty in this hook, therefore, force update_cookie to use the leader-get cookie.
peer_echo bails out if we are using LE juju. So we were in a chicken and egg scenario where peers where never clustering. The first attempted fix called cluster_changed() from leader-settings- changed but this had its own problems. Rather poke the cluster relation so that cluster- relation- changed hooks fire which leads to clustering.
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_unit_test #8731 rabbitmq-
UNIT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_lint_check #9439 rabbitmq-
LINT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_amulet_test #6257 rabbitmq-
AMULET FAIL: amulet-test failed
AMULET Results (max last 2 lines):
make: *** [functional_test] Error 1
ERROR:root:Make target returned non-zero.
Full amulet test output: http://
Build: http://
Ryan Beisner (1chb1n) wrote : | # |
Just a reminder/catchup:
Ignore amulet test results on this MP, as they are the old tests which do not accurately check cluster membership nodes (high risk of false-passing tests), and have multiple opportunities for race conditions to interfere (high risk of false-failing tests).
Instead, we need to base success on whether the refactored tests [1] pass with this MP's changes sync'd into it.
Secondarily, we do also need to manually confirm the same tests succeed with non-LE juju (1.22.6), since the test automation uses LE + 1.24.5 (or whatever is in ppa:juju/stable at the moment).
[1] https:/
Liam Young (gnuoy) wrote : | # |
Thanks for these fixes, 1chb1n has confirmed that were getting clean clustering runs now. Approved.
Preview Diff
1 | === modified file 'hooks/rabbit_utils.py' |
2 | --- hooks/rabbit_utils.py 2015-09-01 04:53:33 +0000 |
3 | +++ hooks/rabbit_utils.py 2015-09-05 01:49:46 +0000 |
4 | @@ -299,7 +299,12 @@ |
5 | cmd = [RABBITMQ_CTL, 'stop_app'] |
6 | subprocess.check_call(cmd) |
7 | cmd = [RABBITMQ_CTL, cluster_cmd, 'rabbit@%s' % node] |
8 | - subprocess.check_call(cmd) |
9 | + try: |
10 | + subprocess.check_output(cmd, stderr=subprocess.STDOUT) |
11 | + except subprocess.CalledProcessError as e: |
12 | + if not e.returncode == 2 or \ |
13 | + "{ok,already_member}" not in e.output: |
14 | + raise e |
15 | cmd = [RABBITMQ_CTL, 'start_app'] |
16 | subprocess.check_call(cmd) |
17 | log('Host clustered with %s.' % node) |
18 | |
19 | === modified file 'hooks/rabbitmq_server_relations.py' |
20 | --- hooks/rabbitmq_server_relations.py 2015-09-01 04:53:33 +0000 |
21 | +++ hooks/rabbitmq_server_relations.py 2015-09-05 01:49:46 +0000 |
22 | @@ -81,6 +81,7 @@ |
23 | peer_store, |
24 | peer_store_and_set, |
25 | peer_retrieve_by_prefix, |
26 | + leader_get, |
27 | ) |
28 | |
29 | from charmhelpers.contrib.network.ip import get_address_in_network |
30 | @@ -304,31 +305,35 @@ |
31 | return |
32 | |
33 | if is_elected_leader('res_rabbitmq_vip'): |
34 | + log('Leader peer_storing cookie', level=INFO) |
35 | cookie = open(rabbit.COOKIE_PATH, 'r').read().strip() |
36 | peer_store('cookie', cookie) |
37 | |
38 | |
39 | @hooks.hook('cluster-relation-changed') |
40 | def cluster_changed(): |
41 | + # Future travelers beware ordering is significant |
42 | + rdata = relation_get() |
43 | + # sync passwords |
44 | + blacklist = ['hostname', 'private-address', 'public-address'] |
45 | + whitelist = [a for a in rdata.keys() if a not in blacklist] |
46 | + peer_echo(includes=whitelist) |
47 | + |
48 | cookie = peer_retrieve('cookie') |
49 | if not cookie: |
50 | log('cluster_joined: cookie not yet set.', level=INFO) |
51 | return |
52 | |
53 | - rdata = relation_get() |
54 | if config('prefer-ipv6') and rdata.get('hostname'): |
55 | - private_address = rdata['private-address'] |
56 | - hostname = rdata['hostname'] |
57 | + private_address = rdata.get('private-address') |
58 | + hostname = rdata.get('hostname') |
59 | if hostname: |
60 | rabbit.update_hosts_file({private_address: hostname}) |
61 | |
62 | - # sync passwords |
63 | - blacklist = ['hostname', 'private-address', 'public-address'] |
64 | - whitelist = [a for a in rdata.keys() if a not in blacklist] |
65 | - peer_echo(includes=whitelist) |
66 | - |
67 | if not is_sufficient_peers(): |
68 | # Stop rabbit until leader has finished configuring |
69 | + log('Not enough peers, stopping until leader is configured', |
70 | + level=INFO) |
71 | service_stop('rabbitmq-server') |
72 | return |
73 | |
74 | @@ -358,9 +363,12 @@ |
75 | amqp_changed(relation_id=rid, remote_unit=unit) |
76 | |
77 | |
78 | -def update_cookie(): |
79 | +def update_cookie(leaders_cookie=None): |
80 | # sync cookie |
81 | - cookie = peer_retrieve('cookie') |
82 | + if leaders_cookie: |
83 | + cookie = leaders_cookie |
84 | + else: |
85 | + cookie = peer_retrieve('cookie') |
86 | cookie_local = None |
87 | with open(rabbit.COOKIE_PATH, 'r') as f: |
88 | cookie_local = f.read().strip() |
89 | @@ -763,6 +771,18 @@ |
90 | |
91 | @hooks.hook('leader-settings-changed') |
92 | def leader_settings_changed(): |
93 | + # Get cookie from leader, update cookie locally and |
94 | + # force cluster-relation-changed hooks to run on peers |
95 | + cookie = leader_get(attribute='cookie') |
96 | + if cookie: |
97 | + update_cookie(leaders_cookie=cookie) |
98 | + # Force cluster-relation-changed hooks to run on peers |
99 | + # This will precipitate peer clustering |
100 | + # Without this a chicken and egg scenario prevails when |
101 | + # using LE and peerstorage |
102 | + for rid in relation_ids('cluster'): |
103 | + relation_set(relation_id=rid, relation_settings={'cookie': cookie}) |
104 | + |
105 | # If leader has changed and access credentials, ripple these |
106 | # out from all units |
107 | for rid in relation_ids('amqp'): |
charm_unit_test #8155 rabbitmq- server- next for thedac mp269395
UNIT OK: passed
Build: http:// 10.245. 162.77: 8080/job/ charm_unit_ test/8155/