StarlingX

Bug #1838411
Comment #3

Comment 3 for bug 1838411

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-08-08:

The test procedure was:
1. run openstack service list in a loop
2. run system host-swact $(hostname)
3. monitor: kubernetes pods, openstack ingress docker logs, DNS availability,
sm-customer.log

What I found so far:
1. controlled swact causes a 20-25s disruption in openstack rest api
   availability. Here's an example (with comments):
   +00.000s controller-1 status: active go-standby
   +00.653s keystone.openstack.svc.cluster.local status: no DNS (rest api
            unavailabile: host name does not resolve to an ip address)
   +02.523s controller-1 status: go-standby disabling
   +14.108s controller-1 status: disabling disabled
   +14.416s controller-0 status: standby go-active
   +17.625s controller-1 status: disabled standby
   +19.134s controller-1 status: standby standby-degraded
   +21.407s controller-1 status: standby-degraded standby
   +21.810s I0806 38 leaderelection.go:205] attempting to acquire leader lease
            openstack/osh-openstack-ingress-nginx...
            (ingress pod fails leader election)
   +21.810s I0806 38 leaderelection.go:249] failed to renew lease
            openstack/osh-openstack-ingress-nginx: failed to tryAcquireOrRenew
            context deadline exceeded
            (ingress pod fails leader election)
   +21.862s keystone.openstack.svc.cluster.local status: 10.100.157.117
            (dns available)
   +21.872s 10.100.157.117 http status: UP
   +21.880s openstack server list: OK (rest api available)
   +22.127s 192.168.206.3 - [192.168.206.3] "GET /v3 HTTP/1.1" 200 222 "-"
            "openstacksdk/0.25.0 keystoneauth1/0.0.0 python-requests/2.21.0
            CPython/2.7.5" 236 0.003 [openstack-keystone-api-ks-pub]
            172.16.193.49:5000 271 0.002 200 d8d497e2622420863667621c52607d23
            (ingress forwards rest api request to keystone)
Note:
  Time stamps related to rest api and dns are captured when the command is
  issued, not when it returns. This explains why apparently "openstack server
  list" is OK before the request is handled by ingress container.

2. kubernetes is not running in HA configuration:
   a. there is only one etcd process. Its data directory is drbd synchronized
      between controllers
   b. etcd process is stopped on the source controller and then started on
      the target controller after swact
   c. kube-controller-manager is restarted on the controller that becomes active
   d. kube-scheduler is restarted on the controller that becomes standby

3. openstack ingress containers are not restarted but they are using leader
election and are impacted by kubernetes cluster glitch caused by swact

My guess for what happens in a controlled swact with an openstack
rest api request:
1. dns is unavailable while resolver floating ip moves between controllers
2. calico networking is already setup on the controller that becomes active
   (no delay caused by container/pod/cluster networking setup)
3. rest api request reaches openstack ingress container but it's not serviced
   because pod detects leader election issue
4. when kubernetes-controller-manager is back ingress resumes forwarding
   requests to openstack keystone and rest api is successful

The test procedure was:
1. run openstack service list in a loop
2. run system host-swact $(hostname)
3. monitor: kubernetes pods, openstack ingress docker logs, DNS availability,
   sm-customer.log

What I found so far:
1. controlled swact causes a 20-25s disruption in openstack rest api
   availability. Here's an example (with comments):
   +00.000s controller-1 status: active go-standby
   +00.653s keystone.openstack.svc.cluster.local status: no DNS (rest api
            unavailabile: host name does not resolve to an ip address)
   +02.523s controller-1 status: go-standby disabling
   +14.108s controller-1 status: disabling disabled
   +14.416s controller-0 status: standby go-active
   +17.625s controller-1 status: disabled standby
   +19.134s controller-1 status: standby standby-degraded
   +21.407s controller-1 status: standby-degraded standby
   +21.810s I0806 38 leaderelection.go:205] attempting to acquire leader lease 
            openstack/osh-openstack-ingress-nginx...
            (ingress pod fails leader election)
   +21.810s I0806 38 leaderelection.go:249] failed to renew lease
            openstack/osh-openstack-ingress-nginx: failed to tryAcquireOrRenew
            context deadline exceeded
            (ingress pod fails leader election)
   +21.862s keystone.openstack.svc.cluster.local status: 10.100.157.117
            (dns available)
   +21.872s 10.100.157.117 http status: UP
   +21.880s openstack server list: OK (rest api available)
   +22.127s 192.168.206.3 - [192.168.206.3] "GET /v3 HTTP/1.1" 200 222 "-"
            "openstacksdk/0.25.0 keystoneauth1/0.0.0 python-requests/2.21.0
            CPython/2.7.5" 236 0.003 [openstack-keystone-api-ks-pub]
            172.16.193.49:5000 271 0.002 200 d8d497e2622420863667621c52607d23
            (ingress forwards rest api request to keystone)
Note:
  Time stamps related to rest api and dns are captured when the command is
  issued, not when it returns. This explains why apparently "openstack server
  list" is OK before the request is handled by ingress container.

3. openstack ingress containers are not restarted but they are using leader
   election and are impacted by kubernetes cluster glitch caused by swact