can't read availability zones and instances don't start

Bug #1823740 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Won't Fix
Critical
David Ames

Bug Description

When testing the openstack next charms, we found that instances wouldn't start and that we're unable to retrieve the list of availability zones.

ubuntu@production-cpe-e9a6b960-8f67-48b6-9695-ae90b23c0b09:~/project/config$ openstack availability zone list
Unable to establish connection to http://10.244.40.91:8774/v2.1/os-availability-zone/detail: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

Instances stay in BUILD status forever:
http://paste.ubuntu.com/p/6qtnthbRWV/

bundle:
http://paste.ubuntu.com/p/MVcWp6xvVK/

Links to crashdumps from several test runs are available here:
https://solutions.qa.canonical.com/#/qa/bug/1823740

Click on an instance ID to see a test run, then go to the artifacts listing at the bottom of the test run page.

We hit this everytime.

affects: cdoqa-system-tests → charm-nova-cloud-controller
tags: added: cdo-qa foundations-engine
summary: - rally times out against openstack charms next
+ can't read availability zones and instances don't start
description: updated
description: updated
tags: added: cdo-release-blocker
description: updated
description: updated
description: updated
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Is it possible that this is bug 1822541?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

http://people.canonical.com/~jhobbs/juju-crashdump-openstack-2019-04-08-21.55.12.tar.gz

nova-cloud-controller_0/var/log/nova/nova-conductor.log from this crashdump shows the same error as bug 1822541:

http://paste.ubuntu.com/p/hDVvX3PhPN/

and we have memcached related to n-c-c. So, I think it is the same issue.

Revision history for this message
David Ames (thedac) wrote :

Confirmed, this appears to be a duplicate of bug bug 1822541.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Marking this bug as a duplicate of the actual upstream bug.

The issue, and the fix, are not tied to charm revisions or releases in any way.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

It's kind of tied to charms in that the new charms force us to relate to memcache, which causes this to happen.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This must not be a dup, or that's not the full story, because we're still seeing the symptoms of this bug with the fixed oslo cache package where we no longer see the traceback.

Here is an updated crashdump:
http://people.canonical.com/~jhobbs/juju-crashdump-openstack-2019-04-11-00.51.42.tar.gz

Revision history for this message
Ryan Beisner (1chb1n) wrote :

We are having trouble locating fresh "next" runs in solutions.qa.c.c related to ^ this bug. Can you please update the bug with distinct links to the run with the proposed packages?

Changed in charm-nova-cloud-controller:
importance: Undecided → Critical
assignee: nobody → Sahid Orentino (sahid-ferdjaoui)
milestone: none → 19.04
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Changed in charm-nova-cloud-controller:
status: Incomplete → New
Revision history for this message
James Page (james-page) wrote :

I'm still struggling to reproduce this; I validated I saw the original issue (BUILD blocking instances) with rocky, and then I upgraded to rocky-proposed - instances are building fine and the call above is working:

$ openstack availability zone list
+-----------+-------------+
| Zone Name | Zone Status |
+-----------+-------------+
| internal | available |
| nova | available |
| nova | available |
| nova | available |
| nova | available |
+-----------+-------------+

HA nova-cloud-controller deployment with 3 units and a relation to memcached

Revision history for this message
James Page (james-page) wrote :

@jhobbs

I'm struggling to actually retrieve the test artefacts for that most recent run -

https://oil-jenkins.canonical.com/artifacts/133de1bc-7cdf-40c4-9a8e-7948986ec8c3/index.html

is returning a not found.

Changed in charm-nova-cloud-controller:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

@james-page there was a hiccup swift - they are there now, thanks.

Changed in charm-nova-cloud-controller:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This is the latest reproduction now:

https://solutions.qa.canonical.com/#/qa/testRun/4b4e2f84-b3b3-4b5d-a070-005125f12abd

rally failed timing out after 5 minutes waiting for an instance to change from BUILD to ACTIVE.

I can get an instance to start if I go through horizon.

However, I can't list availability zones, neither through horizon nor the cli:

ubuntu@production-cpe-4b4e2f84-b3b3-4b5d-a070-005125f12abd:~/project$ openstack availability zone list
^^^ Disconnects

Revision history for this message
David Ames (thedac) wrote :
Download full text (4.0 KiB)

TRIAGE:

Confirmed rocky-proposed in use.
Confirmed python3-oslo.cache 1.30.1-0ubuntu1.1~cloud0

The bug is in the relation between nova-cloud-controller and memcached. It is using the wrong space IP address.

Steps to recreate:

When running:
openstack availability zone list
^[Unable to establish connection to http://10.244.40.91:8774/v2.1/os-availability-zone/detail: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

On nova-cloud-conroller/0
root@juju-ab7f6f-18-lxd-5:/var/log/nova# netstat -tn|grep SYN
tcp 0 1 192.168.33.172:39964 192.168.33.163:11211 SYN_SENT

Manual attempt also times out:
root@juju-ab7f6f-18-lxd-5:/var/log/nova# nc -vz 192.168.33.163 11211

ifconfig for nova-cloud-controller/0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 10.244.41.50 netmask 255.255.248.0 broadcast 10.244.47.255
        inet6 fe80::216:3eff:fe8c:81e2 prefixlen 64 scopeid 0x20<link>
        ether 00:16:3e:8c:81:e2 txqueuelen 1000 (Ethernet)
        RX packets 254198 bytes 295532290 (295.5 MB)
        RX errors 0 dropped 1 overruns 0 frame 0
        TX packets 284144 bytes 109114470 (109.1 MB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
        inet 192.168.33.172 netmask 255.255.255.0 broadcast 192.168.33.255
        inet6 fe80::216:3eff:fe8a:e400 prefixlen 64 scopeid 0x20<link>
        ether 00:16:3e:8a:e4:00 txqueuelen 1000 (Ethernet)
        RX packets 369976 bytes 139660554 (139.6 MB)
        RX errors 0 dropped 1 overruns 0 frame 0
        TX packets 438489 bytes 84210710 (84.2 MB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 127.0.0.1 netmask 255.0.0.0
        inet6 ::1 prefixlen 128 scopeid 0x10<host>
        loop txqueuelen 1000 (Local Loopback)
        RX packets 151594 bytes 44616315 (44.6 MB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 151594 bytes 44616315 (44.6 MB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

memcached/1 iptables rules
Chain ufw-user-input (1 references)
target prot opt source destination
ACCEPT tcp -- 10.244.41.55 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 10.244.41.52 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 10.244.41.50 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.180 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.173 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.164 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.167 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.156 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.147 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 192.168.33.151 0.0.0.0/0 tcp dpt:11211
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:22
DROP tcp -- 0.0.0.0/0 0.0.0.0/0 tcp...

Read more...

Revision history for this message
David Ames (thedac) wrote :

It appears both charms are doing the right thing by checking get_relation_ip with the relation name:

https://github.com/openstack/charm-nova-cloud-controller/blob/bd3d84cfcd2b7696fd5299ac2c906e491d2a73e6/hooks/nova_cc_hooks.py#L970
https://git.launchpad.net/memcached-charm/tree/hooks/memcached_hooks.py#n161

Which suggests the bundle will require:

nova-cloud-controller:
  bindings:
    memcache: $CORRECT_SPACE

memcached:
  bindings:
    cache: $CORRECT_SPACE

Please test this.

Changed in charm-nova-cloud-controller:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

testing confirms the binding was the issue, and that correct it fixes all the problems. Thanks.

James Page (james-page)
Changed in charm-nova-cloud-controller:
assignee: Sahid Orentino (sahid-ferdjaoui) → David Ames (thedac)
status: Incomplete → New
Chris Gregan (cgregan)
tags: removed: cdo-release-blocker
Revision history for this message
Ryan Beisner (1chb1n) wrote :

On the topic of mutable bindings, there is no solution currently except for further development in Juju core. Regarding this condition which can affect clouds where default bindings, or any bindings are in play, the work-around is now documented in the charm-guide release notes for the 19.04 OpenStack Charms:

https://docs.openstack.org/charm-guide/latest/1904.html

tldr; Add a 2nd (new) memcached application instance to the model and define the bindings at deploy-time for that new application. See link above for more detail.

Changed in charm-nova-cloud-controller:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.