The test procedure was:
1. run openstack service list in a loop
2. run system host-swact $(hostname)
3. monitor: kubernetes pods, openstack ingress docker logs, DNS availability,
sm-customer.log
What I found so far:
1. controlled swact causes a 20-25s disruption in openstack rest api
availability. Here's an example (with comments):
+00.000s controller-1 status: active go-standby
+00.653s keystone.openstack.svc.cluster.local status: no DNS (rest api unavailabile: host name does not resolve to an ip address)
+02.523s controller-1 status: go-standby disabling
+14.108s controller-1 status: disabling disabled
+14.416s controller-0 status: standby go-active
+17.625s controller-1 status: disabled standby
+19.134s controller-1 status: standby standby-degraded
+21.407s controller-1 status: standby-degraded standby
+21.810s I0806 38 leaderelection.go:205] attempting to acquire leader lease openstack/osh-openstack-ingress-nginx... (ingress pod fails leader election)
+21.810s I0806 38 leaderelection.go:249] failed to renew lease openstack/osh-openstack-ingress-nginx: failed to tryAcquireOrRenew
context deadline exceeded (ingress pod fails leader election)
+21.862s keystone.openstack.svc.cluster.local status: 10.100.157.117
(dns available)
+21.872s 10.100.157.117 http status: UP
+21.880s openstack server list: OK (rest api available)
+22.127s 192.168.206.3 - [192.168.206.3] "GET /v3 HTTP/1.1" 200 222 "-" "openstacksdk/0.25.0 keystoneauth1/0.0.0 python-requests/2.21.0 CPython/2.7.5" 236 0.003 [openstack-keystone-api-ks-pub] 172.16.193.49:5000 271 0.002 200 d8d497e2622420863667621c52607d23 (ingress forwards rest api request to keystone)
Note:
Time stamps related to rest api and dns are captured when the command is
issued, not when it returns. This explains why apparently "openstack server
list" is OK before the request is handled by ingress container.
2. kubernetes is not running in HA configuration:
a. there is only one etcd process. Its data directory is drbd synchronized
between controllers
b. etcd process is stopped on the source controller and then started on
the target controller after swact
c. kube-controller-manager is restarted on the controller that becomes active
d. kube-scheduler is restarted on the controller that becomes standby
3. openstack ingress containers are not restarted but they are using leader
election and are impacted by kubernetes cluster glitch caused by swact
My guess for what happens in a controlled swact with an openstack
rest api request:
1. dns is unavailable while resolver floating ip moves between controllers
2. calico networking is already setup on the controller that becomes active
(no delay caused by container/pod/cluster networking setup)
3. rest api request reaches openstack ingress container but it's not serviced
because pod detects leader election issue
4. when kubernetes-controller-manager is back ingress resumes forwarding
requests to openstack keystone and rest api is successful
The test procedure was:
1. run openstack service list in a loop
2. run system host-swact $(hostname)
3. monitor: kubernetes pods, openstack ingress docker logs, DNS availability,
sm-customer.log
What I found so far: openstack. svc.cluster. local status: no DNS (rest api
unavailabi le: host name does not resolve to an ip address) go:205] attempting to acquire leader lease
openstack/ osh-openstack- ingress- nginx.. .
(ingress pod fails leader election) go:249] failed to renew lease
openstack/ osh-openstack- ingress- nginx: failed to tryAcquireOrRenew
(ingress pod fails leader election) openstack. svc.cluster. local status: 10.100.157.117
"openstack sdk/0.25. 0 keystoneauth1/0.0.0 python- requests/ 2.21.0
CPython/ 2.7.5" 236 0.003 [openstack- keystone- api-ks- pub]
172. 16.193. 49:5000 271 0.002 200 d8d497e26224208 63667621c52607d 23
(ingress forwards rest api request to keystone)
1. controlled swact causes a 20-25s disruption in openstack rest api
availability. Here's an example (with comments):
+00.000s controller-1 status: active go-standby
+00.653s keystone.
+02.523s controller-1 status: go-standby disabling
+14.108s controller-1 status: disabling disabled
+14.416s controller-0 status: standby go-active
+17.625s controller-1 status: disabled standby
+19.134s controller-1 status: standby standby-degraded
+21.407s controller-1 status: standby-degraded standby
+21.810s I0806 38 leaderelection.
+21.810s I0806 38 leaderelection.
context deadline exceeded
+21.862s keystone.
(dns available)
+21.872s 10.100.157.117 http status: UP
+21.880s openstack server list: OK (rest api available)
+22.127s 192.168.206.3 - [192.168.206.3] "GET /v3 HTTP/1.1" 200 222 "-"
Note:
Time stamps related to rest api and dns are captured when the command is
issued, not when it returns. This explains why apparently "openstack server
list" is OK before the request is handled by ingress container.
2. kubernetes is not running in HA configuration: -manager is restarted on the controller that becomes active
a. there is only one etcd process. Its data directory is drbd synchronized
between controllers
b. etcd process is stopped on the source controller and then started on
the target controller after swact
c. kube-controller
d. kube-scheduler is restarted on the controller that becomes standby
3. openstack ingress containers are not restarted but they are using leader
election and are impacted by kubernetes cluster glitch caused by swact
My guess for what happens in a controlled swact with an openstack pod/cluster networking setup) controller- manager is back ingress resumes forwarding
rest api request:
1. dns is unavailable while resolver floating ip moves between controllers
2. calico networking is already setup on the controller that becomes active
(no delay caused by container/
3. rest api request reaches openstack ingress container but it's not serviced
because pod detects leader election issue
4. when kubernetes-
requests to openstack keystone and rest api is successful