Comment 22 for bug 1341420

Revision history for this message
Yingxin (cyx1231st) wrote :

Did anyone notice the problem that `select_destination()` doesn't reuse host state cache to make decisions at all?

That is to say:
Scheduler will discard its host state and refresh it from DB each time at the beginning of `select_destination()`, instead of reusing the recent updated host state cache.

### explanation ###

Robert Collins pointed out in the bug description that scheduler works well in situation[1] when boot 45 instances in one command. But if he chose to boot these 45 instances in 45 concurrent commends in the second situation[2], there will be up to 50% failure, which is unacceptable.

The real difference between [1] and [2] is that:
- In the first situation[1], scheduler will reuse the host state cache in the `for` loop [3], so that the following 44 schedule decisions are made on the INCREMENTAL UPDATED host state cache. Thus the result turns out to be accurate.
- However in [2], each 45 requests will refresh the host state in `get_all_host_states()` by logic [4] in the beginning. Thus these 45 concurrent schedule decisions are all made based on the SAME db state. No wonder there are 50% failure caused by conflictions. Worse, it could be 97.77% failure if CONF.scheduler_host_subset_size is 1 and in the most idea condition.

Another thing to point out:
Currently, there is no `sleep(0)` or any asynchronous request(except for the experimental trusted filter) during filtering and weighing. So schedule operations including 'refresh host state from db data', 'filtering', 'weighing', 'consume host states', 'return decision' can be treat as an atomic operation as a whole. This strengthens my opinion that in situation[2], scheduler will use the 'almost' same host state cache in 45 concurrent requests. Thus definitely those decisions will conflict.

[1] 15:46 < lifeless> nova boot <...> --num-instances 45 -> works fairly reliably. Some minor timeout related things to fix but nothing dramatic.
[2] 15:47 < lifeless> heat create-stack <...> with a stack with 45 instances in it -> about 50% of instances fail to come up
[3] https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L117-L149
[4] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L160-L222

### solution ###

The reasonable solution is to refresh host state only if the fetched data is newer than the db data which host state is based on.

Current timestamp in compute_node table is not accurate enough because this record is only up to seconds. It should be at least in milliseconds or even in updated counts to determine whether a host state is outdated. This might deserve a bp to implement.