Bug #1823740 “can't read availability zones and instances don't ...” : Bugs : OpenStack Nova Cloud Controller Charm

Jason Hobbs (jason-hobbs) on 2019-04-08

affects:	cdoqa-system-tests → charm-nova-cloud-controller
tags:	added: cdo-qa foundations-engine
summary:	- rally times out against openstack charms next + can't read availability zones and instances don't start

Jason Hobbs (jason-hobbs) on 2019-04-08

description:	updated
description:	updated
tags:	added: cdo-release-blocker

Jason Hobbs (jason-hobbs) on 2019-04-08

description:	updated
description:	updated

Jason Hobbs (jason-hobbs) on 2019-04-08

description:

updated

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-04-08:

#1

Is it possible that this is bug 1822541?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-04-08:

#2

http://people.canonical.com/~jhobbs/juju-crashdump-openstack-2019-04-08-21.55.12.tar.gz

nova-cloud-controller_0/var/log/nova/nova-conductor.log from this crashdump shows the same error as bug 1822541:

http://paste.ubuntu.com/p/hDVvX3PhPN/

and we have memcached related to n-c-c. So, I think it is the same issue.

Revision history for this message

David Ames (thedac) wrote on 2019-04-08:

#3

Confirmed, this appears to be a duplicate of bug bug 1822541.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2019-04-09:

#4

Marking this bug as a duplicate of the actual upstream bug.

The issue, and the fix, are not tied to charm revisions or releases in any way.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-04-09:

#5

It's kind of tied to charms in that the new charms force us to relate to memcache, which causes this to happen.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-04-11:

#6

This must not be a dup, or that's not the full story, because we're still seeing the symptoms of this bug with the fixed oslo cache package where we no longer see the traceback.

Here is an updated crashdump:
http://people.canonical.com/~jhobbs/juju-crashdump-openstack-2019-04-11-00.51.42.tar.gz

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2019-04-12:

#7

We are having trouble locating fresh "next" runs in solutions.qa.c.c related to ^ this bug. Can you please update the bug with distinct links to the run with the proposed packages?

Changed in charm-nova-cloud-controller:
importance:	Undecided → Critical
assignee:	nobody → Sahid Orentino (sahid-ferdjaoui)
milestone:	none → 19.04
status:	New → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-04-12:

#8

https://solutions.qa.canonical.com/#/qa/testRun/133de1bc-7cdf-40c4-9a8e-7948986ec8c3 This is the freshest run.

Changed in charm-nova-cloud-controller:
status:	Incomplete → New

Revision history for this message

James Page (james-page) wrote on 2019-04-12:

#9

I'm still struggling to reproduce this; I validated I saw the original issue (BUILD blocking instances) with rocky, and then I upgraded to rocky-proposed - instances are building fine and the call above is working:

HA nova-cloud-controller deployment with 3 units and a relation to memcached

Revision history for this message

James Page (james-page) wrote on 2019-04-12:

#10

@jhobbs

I'm struggling to actually retrieve the test artefacts for that most recent run -

https://oil-jenkins.canonical.com/artifacts/133de1bc-7cdf-40c4-9a8e-7948986ec8c3/index.html

is returning a not found.

Changed in charm-nova-cloud-controller:
status:	New → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-04-12:

#11

@james-page there was a hiccup swift - they are there now, thanks.

Changed in charm-nova-cloud-controller:
status:	Incomplete → New

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-04-12:

#12

This is the latest reproduction now:

https://solutions.qa.canonical.com/#/qa/testRun/4b4e2f84-b3b3-4b5d-a070-005125f12abd

rally failed timing out after 5 minutes waiting for an instance to change from BUILD to ACTIVE.

I can get an instance to start if I go through horizon.

However, I can't list availability zones, neither through horizon nor the cli:

ubuntu@production-cpe-4b4e2f84-b3b3-4b5d-a070-005125f12abd:~/project$ openstack availability zone list
^^^ Disconnects

Revision history for this message

David Ames (thedac) wrote on 2019-04-12:

#13

Download full text (4.0 KiB)

TRIAGE:

Confirmed rocky-proposed in use.
Confirmed python3-oslo.cache 1.30.1-0ubuntu1.1~cloud0

The bug is in the relation between nova-cloud-controller and memcached. It is using the wrong space IP address.

Steps to recreate:

When running:
openstack availability zone list
^[Unable to establish connection to http://10.244.40.91:8774/v2.1/os-availability-zone/detail: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

On nova-cloud-conroller/0
root@juju-ab7f6f-18-lxd-5:/var/log/nova# netstat -tn|grep SYN
tcp 0 1 192.168.33.172:39964 192.168.33.163:11211 SYN_SENT

Manual attempt also times out:
root@juju-ab7f6f-18-lxd-5:/var/log/nova# nc -vz 192.168.33.163 11211

ifconfig for nova-cloud-controller/0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 10.244.41.50 netmask 255.255.248.0 broadcast 10.244.47.255
        inet6 fe80::216:3eff:fe8c:81e2 prefixlen 64 scopeid 0x20<link>
        ether 00:16:3e:8c:81:e2 txqueuelen 1000 (Ethernet)
        RX packets 254198 bytes 295532290 (295.5 MB)
        RX errors 0 dropped 1 overruns 0 frame 0
        TX packets 284144 bytes 109114470 (109.1 MB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
        inet 192.168.33.172 netmask 255.255.255.0 broadcast 192.168.33.255
        inet6 fe80::216:3eff:fe8a:e400 prefixlen 64 scopeid 0x20<link>
        ether 00:16:3e:8a:e4:00 txqueuelen 1000 (Ethernet)
        RX packets 369976 bytes 139660554 (139.6 MB)
        RX errors 0 dropped 1 overruns 0 frame 0
        TX packets 438489 bytes 84210710 (84.2 MB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 127.0.0.1 netmask 255.0.0.0
        inet6 ::1 prefixlen 128 scopeid 0x10<host>
        loop txqueuelen 1000 (Local Loopback)
        RX packets 151594 bytes 44616315 (44.6 MB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 151594 bytes 44616315 (44.6 MB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

memcached/1 iptables rules
Chain ufw-user-input (1 references)
target prot opt source destination
ACCEPT tcp -- 10.244.41.55 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 10.244.41.52 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 10.244.41.50 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.180 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.173 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.164 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.167 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.156 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.147 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.151 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:22
DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp...

TRIAGE:

Confirmed rocky-proposed in use.
Confirmed python3-oslo.cache 1.30.1-0ubuntu1.1~cloud0

The bug is in the relation between nova-cloud-controller and memcached. It is using the wrong space IP address.

Steps to recreate:

When running:
openstack availability zone list
^[Unable to establish connection to http://10.244.40.91:8774/v2.1/os-availability-zone/detail: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

On nova-cloud-conroller/0
root@juju-ab7f6f-18-lxd-5:/var/log/nova# netstat -tn|grep SYN
tcp        0      1 192.168.33.172:39964    192.168.33.163:11211    SYN_SENT

Manual attempt also times out:
root@juju-ab7f6f-18-lxd-5:/var/log/nova# nc -vz 192.168.33.163 11211

ifconfig for nova-cloud-controller/0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.244.41.50  netmask 255.255.248.0  broadcast 10.244.47.255
        inet6 fe80::216:3eff:fe8c:81e2  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:8c:81:e2  txqueuelen 1000  (Ethernet)
        RX packets 254198  bytes 295532290 (295.5 MB)
        RX errors 0  dropped 1  overruns 0  frame 0
        TX packets 284144  bytes 109114470 (109.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 192.168.33.172  netmask 255.255.255.0  broadcast 192.168.33.255
        inet6 fe80::216:3eff:fe8a:e400  prefixlen 64  scopeid 0x20<link>
        ether 00:16:3e:8a:e4:00  txqueuelen 1000  (Ethernet)
        RX packets 369976  bytes 139660554 (139.6 MB)
        RX errors 0  dropped 1  overruns 0  frame 0
        TX packets 438489  bytes 84210710 (84.2 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 151594  bytes 44616315 (44.6 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 151594  bytes 44616315 (44.6 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

memcached/1 iptables rules
Chain ufw-user-input (1 references)
target     prot opt source               destination         
ACCEPT     tcp  --  10.244.41.55         0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  10.244.41.52         0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  10.244.41.50         0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  192.168.33.180       0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  192.168.33.173       0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  192.168.33.164       0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  192.168.33.167       0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  192.168.33.156       0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  192.168.33.147       0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  192.168.33.151       0.0.0.0/0            tcp dpt:11211
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:22
DROP       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:11211

juju run --application nova-cloud-controller -- 'ifconfig |egrep "192.168.33|10.244.41"' 
- Stdout: |2
            inet 10.244.41.52  netmask 255.255.248.0  broadcast 10.244.47.255
            inet 192.168.33.174  netmask 255.255.255.0  broadcast 192.168.33.255
  UnitId: nova-cloud-controller/1
- Stdout: |2
            inet 10.244.41.55  netmask 255.255.248.0  broadcast 10.244.47.255
            inet 192.168.33.177  netmask 255.255.255.0  broadcast 192.168.33.255
  UnitId: nova-cloud-controller/2
- Stdout: |2
            inet 10.244.41.50  netmask 255.255.248.0  broadcast 10.244.47.255
            inet 192.168.33.172  netmask 255.255.255.0  broadcast 192.168.33.255
  UnitId: nova-cloud-controller/0

Has the wrong space entry for all nova-cloud-controllers: i.e. 10.244.41.50 vs 192.168.33.172

Next steps:
check both nova-cloud-controller and memcached charms to make sure they use the correct space for the relation.

Revision history for this message

David Ames (thedac) wrote on 2019-04-12:

#14

It appears both charms are doing the right thing by checking get_relation_ip with the relation name:

https://github.com/openstack/charm-nova-cloud-controller/blob/bd3d84cfcd2b7696fd5299ac2c906e491d2a73e6/hooks/nova_cc_hooks.py#L970
https://git.launchpad.net/memcached-charm/tree/hooks/memcached_hooks.py#n161

Which suggests the bundle will require:

nova-cloud-controller:
bindings:
memcache: $CORRECT_SPACE

memcached:
bindings:
cache: $CORRECT_SPACE

Please test this.

Changed in charm-nova-cloud-controller:
status:	New → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2019-04-12:

#15

testing confirms the binding was the issue, and that correct it fixes all the problems. Thanks.

James Page (james-page) on 2019-04-15

Changed in charm-nova-cloud-controller:
assignee:	Sahid Orentino (sahid-ferdjaoui) → David Ames (thedac)
status:	Incomplete → New

Chris Gregan (cgregan) on 2019-04-17

tags:

removed: cdo-release-blocker

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2019-04-17:

#16

On the topic of mutable bindings, there is no solution currently except for further development in Juju core. Regarding this condition which can affect clouds where default bindings, or any bindings are in play, the work-around is now documented in the charm-guide release notes for the 19.04 OpenStack Charms:

https://docs.openstack.org/charm-guide/latest/1904.html

tldr; Add a 2nd (new) memcached application instance to the model and define the bindings at deploy-time for that new application. See link above for more detail.

Changed in charm-nova-cloud-controller:
status:	New → Won't Fix

OpenStack Nova Cloud Controller Charm

can't read availability zones and instances don't start

Bug Description

Other bug subscribers

Remote bug watches