Juju Charms Collection
rabbitmq-server package

Merge lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes into lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next

Proposed by David Ames on 2015-08-27

Status:	Merged
Merged at revision:	111
Proposed branch:	lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes
Merge into:	lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next
Diff against target:	107 lines (+36/-11) 2 files modified hooks/rabbit_utils.py (+6/-1) hooks/rabbitmq_server_relations.py (+30/-10)
To merge this branch:	bzr merge lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes
Related bugs:	Link a bug report

Reviewer	Date Requested	Status
Liam Young (community)	2015-08-31	Approve on 2015-09-09
Billy Olsen	2015-08-27	Needs Resubmitting on 2015-09-02
Ryan Beisner (community)		Needs Resubmitting on 2015-09-01
Review via email: mp+269395@code.launchpad.net

Description of the change

Fix clustering race conditions for LP Bug#1486177

For juju >= 1.24 run cluster_changed() when the leader-settings-changed hook executes.
For juju < 1.24 use remote_unit() in relation_get when in a native cluster relation hook

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-08-27:

charm_unit_test #8155 rabbitmq-server-next for thedac mp269395
UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8155/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-08-27:

charm_lint_check #8827 rabbitmq-server-next for thedac mp269395
LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/8827/

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2015-08-27:

Take note that the amulet test results on this MP are irrelevant to this proposal, as they are the old tests which do not accurately check that each node is a cluster member; They only check that the `rabbitmqctl cluster_status` command exits 0, which will be True even for a 2 or 1-node cluster.

I've re-sync'd these proposed changes into my amulet test refactor branch, and will confirm consistent clustering behavior and report back here.

Thank you for your work on this.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2015-08-27:

Just noting that this MP is dependent on https://code.launchpad.net/~thedac/charm-helpers/legacy_leadership_peer_retrieve_fix/+merge/269400 landing, pending validation of functional tests of course.

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-08-27:

charm_amulet_test #6061 rabbitmq-server-next for thedac mp269395
AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/6061/

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-08-31:

Couple of minor comments below. Would be nice to see some unit tests around this too please.

review: Needs Fixing

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-08-31:

charm_unit_test #8407 rabbitmq-server-next for thedac mp269395
UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8407/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-08-31:

charm_lint_check #9098 rabbitmq-server-next for thedac mp269395
LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9098/

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2015-08-31:

Bah. This now conflicts after next rev 107 (https://code.launchpad.net/~niedbalski/charms/trusty/rabbitmq-server/fix-lp-1489053/+merge/269261).

Can you rebase and resolve conflicts in this MP? I can then pull it back into the test branch, which I'm actively working on atm. Much appreciated.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2015-09-01:

Thanks for making the changes requested David! I've gone ahead and resolved the merge conflict in my merge (since I contributed to causing the merge conflict). Ryan has given enough testing through his own OSCI runs and such for me to give this one a thumbs up and push it through now.

review: Approve

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2015-09-01:

Actually, I didn't get to test r107 + this, which definitely is not working: http://paste.ubuntu.com/12242485/

I recommend reverting rmq/next r108 and r107 (back to 106) so we can get this CRIT finished, then land the new tests, and only then re-consider other proposals to rmq when we can reliably test them.

review: Needs Resubmitting

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-01:

charm_lint_check #9165 rabbitmq-server-next for thedac mp269395
LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9165/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-01:

charm_unit_test #8471 rabbitmq-server-next for thedac mp269395
UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8471/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-01:

charm_amulet_test #6178 rabbitmq-server-next for thedac mp269395
AMULET FAIL: amulet-test failed

AMULET Results (max last 2 lines):
make: *** [functional_test] Error 1
ERROR:root:Make target returned non-zero.

Full amulet test output: http://paste.ubuntu.com/12247553/
Build: http://10.245.162.77:8080/job/charm_amulet_test/6178/

Revision history for this message

Billy Olsen (billy-olsen) on 2015-09-02:

review: Needs Resubmitting

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-03:

charm_unit_test #8625 rabbitmq-server-next for thedac mp269395
UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8625/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-03:

charm_lint_check #9327 rabbitmq-server-next for thedac mp269395
LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9327/

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2015-09-04:

FYI - leader-settings-changed hook fail (http://paste.ubuntu.com/12270552) with these changes sync'd into next.

lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes updated on 2015-09-04

111. By David Ames on 2015-09-04: leader-settings-changed may run before the cluster relation exists.

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-04:

charm_lint_check #9388 rabbitmq-server-next for thedac mp269395
LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9388/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-04:

charm_unit_test #8683 rabbitmq-server-next for thedac mp269395
UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8683/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-04:

charm_amulet_test #6252 rabbitmq-server-next for thedac mp269395
AMULET OK: passed

Build: http://10.245.162.77:8080/job/charm_amulet_test/6252/

lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes updated on 2015-09-05

112. By David Ames on 2015-09-05: Get cookie from juju elected leader, update cookie locally and force cluster-relation-changed hooks to run on peers.
leader-election-changed runs only occur when using a version of juju with leader election so there is not the same level of uncertainty in this hook, therefore, force update_cookie to use the leader-get cookie.
peer_echo bails out if we are using LE juju. So we were in a chicken and egg scenario where peers where never clustering. The first attempted fix called cluster_changed() from leader-settings-changed but this had its own problems. Rather poke the cluster relation so that cluster-relation-changed hooks fire which leads to clustering.

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-05:

charm_unit_test #8731 rabbitmq-server-next for thedac mp269395
UNIT OK: passed

Build: http://10.245.162.77:8080/job/charm_unit_test/8731/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-05:

charm_lint_check #9439 rabbitmq-server-next for thedac mp269395
LINT OK: passed

Build: http://10.245.162.77:8080/job/charm_lint_check/9439/

Revision history for this message

uosci-testing-bot (uosci-testing-bot) wrote on 2015-09-05:

charm_amulet_test #6257 rabbitmq-server-next for thedac mp269395
AMULET FAIL: amulet-test failed

AMULET Results (max last 2 lines):
make: *** [functional_test] Error 1
ERROR:root:Make target returned non-zero.

Full amulet test output: http://paste.ubuntu.com/12279205/
Build: http://10.245.162.77:8080/job/charm_amulet_test/6257/

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2015-09-05:

Just a reminder/catchup:

Ignore amulet test results on this MP, as they are the old tests which do not accurately check cluster membership nodes (high risk of false-passing tests), and have multiple opportunities for race conditions to interfere (high risk of false-failing tests).

Instead, we need to base success on whether the refactored tests [1] pass with this MP's changes sync'd into it.

Secondarily, we do also need to manually confirm the same tests succeed with non-LE juju (1.22.6), since the test automation uses LE + 1.24.5 (or whatever is in ppa:juju/stable at the moment).

[1] https://code.launchpad.net/~1chb1n/charms/trusty/rabbitmq-server/amulet-refactor-1509b/+merge/270102

Revision history for this message

Liam Young (gnuoy) wrote on 2015-09-09:

Thanks for these fixes, 1chb1n has confirmed that were getting clean clustering runs now. Approved.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

David Ames

Nobuto Murata

OpenStack Charmers

 === modified file 'hooks/rabbit_utils.py'
 --- hooks/rabbit_utils.py	2015-09-01 04:53:33 +0000
 +++ hooks/rabbit_utils.py	2015-09-05 01:49:46 +0000
@@ -299,7 +299,12 @@
              cmd = [RABBITMQ_CTL, 'stop_app']
              subprocess.check_call(cmd)
              cmd = [RABBITMQ_CTL, cluster_cmd, 'rabbit@%s' % node]
--            subprocess.check_call(cmd)
++            try:
++                subprocess.check_output(cmd, stderr=subprocess.STDOUT)
++            except subprocess.CalledProcessError as e:
++                if not e.returncode == 2 or \
++                        "{ok,already_member}" not in e.output:
++                    raise e
              cmd = [RABBITMQ_CTL, 'start_app']
              subprocess.check_call(cmd)
              log('Host clustered with %s.' % node)
 === modified file 'hooks/rabbitmq_server_relations.py'
 --- hooks/rabbitmq_server_relations.py	2015-09-01 04:53:33 +0000
 +++ hooks/rabbitmq_server_relations.py	2015-09-05 01:49:46 +0000
@@ -81,6 +81,7 @@
      peer_store,
      peer_store_and_set,
      peer_retrieve_by_prefix,
++    leader_get,
+ )
  from charmhelpers.contrib.network.ip import get_address_in_network
@@ -304,31 +305,35 @@
          return
      if is_elected_leader('res_rabbitmq_vip'):
++        log('Leader peer_storing cookie', level=INFO)
          cookie = open(rabbit.COOKIE_PATH, 'r').read().strip()
          peer_store('cookie', cookie)
  @hooks.hook('cluster-relation-changed')
  def cluster_changed():
++    # Future travelers beware ordering is significant
++    rdata = relation_get()
++    # sync passwords
++    blacklist = ['hostname', 'private-address', 'public-address']
++    whitelist = [a for a in rdata.keys() if a not in blacklist]
++    peer_echo(includes=whitelist)
++
      cookie = peer_retrieve('cookie')
      if not cookie:
          log('cluster_joined: cookie not yet set.', level=INFO)
          return
--    rdata = relation_get()
      if config('prefer-ipv6') and rdata.get('hostname'):
--        private_address = rdata['private-address']
--        hostname = rdata['hostname']
++        private_address = rdata.get('private-address')
++        hostname = rdata.get('hostname')
          if hostname:
              rabbit.update_hosts_file({private_address: hostname})
--    # sync passwords
--    blacklist = ['hostname', 'private-address', 'public-address']
--    whitelist = [a for a in rdata.keys() if a not in blacklist]
--    peer_echo(includes=whitelist)
--
      if not is_sufficient_peers():
          # Stop rabbit until leader has finished configuring
++        log('Not enough peers, stopping until leader is configured',
++            level=INFO)
          service_stop('rabbitmq-server')
          return
@@ -358,9 +363,12 @@
              amqp_changed(relation_id=rid, remote_unit=unit)
--def update_cookie():
++def update_cookie(leaders_cookie=None):
      # sync cookie
--    cookie = peer_retrieve('cookie')
++    if leaders_cookie:
++        cookie = leaders_cookie
++    else:
++        cookie = peer_retrieve('cookie')
      cookie_local = None
      with open(rabbit.COOKIE_PATH, 'r') as f:
          cookie_local = f.read().strip()
@@ -763,6 +771,18 @@
  @hooks.hook('leader-settings-changed')
  def leader_settings_changed():
++    # Get cookie from leader, update cookie locally and
++    # force cluster-relation-changed hooks to run on peers
++    cookie = leader_get(attribute='cookie')
++    if cookie:
++        update_cookie(leaders_cookie=cookie)
++        # Force cluster-relation-changed hooks to run on peers
++        # This will precipitate peer clustering
++        # Without this a chicken and egg scenario prevails when
++        # using LE and peerstorage
++        for rid in relation_ids('cluster'):
++            relation_set(relation_id=rid, relation_settings={'cookie': cookie})
++
      # If leader has changed and access credentials, ripple these
      # out from all units
      for rid in relation_ids('amqp'):

Juju Charms Collectionrabbitmq-server package

Merge lp:~thedac/charms/trusty/rabbitmq-server/native-cluster-race-fixes into lp:~openstack-charmers-archive/charms/trusty/rabbitmq-server/next

Commit message

Description of the change

Preview Diff

Subscribers

Juju Charms Collection
rabbitmq-server package