Comment 24 for bug 1341420

Revision history for this message
Yingxin (cyx1231st) wrote :

<please forget about my previous comments>

I've reproduced this bug with *one* scheduler instance based on the latest nova code.

----- reproduce -----

My openstack environment:
There are 3 compute nodes(node2, node3, node4), each of them has limited 1489MB free RAM and plenty of other resources to boot instances. The scheduler is configured "ram_allocation_ration = 1.0" to prevent memory overcommit. And there is only one flavor that requests 400MB RAM, so that at most 3 instances are allowed to boot onto each compute node. 9 instances at most in this 3-node environment.

Then I requested to boot 9 instances named p1, p2, p3, p4, p5, p6, p7, p8, p9 using 9 separate Restful API requests. I sent those requests in 3 groups, each group will send 3 requests in parallel, and those groups are sent in every 0.5 second. It takes a second to finish all those requests.

Those 9 requests are expected to be successful, but here is the full log [2] showing that it is actually not.
I also made some efforts to make those logs look more friendly [3].
As the log said, the request to schedule p8 is failed with 2 retries. Although it finally succeeded to select node3, it is likely to fail if there are more nodes. Imagine there could be up to 1000 nodes with much busier and complex requests.

I modified scheduler logging a little to better understand what is happening inside scheduler: Add log to indicate the beginning and the end of `select_destination()` with the instance name from request_spec; Add log to indicate the start and end time around `objects.ComputeNodeList.get_all()` in `get_all_host_states()`; Most importantly, I added logs to the code[1] to show whether a host state is successfully updated from compute node db object.

----- log analysis -----

Things go well during #L1-L94, until the request 8582 schedules the instance p8 to node4. Note that node4 already has 3 instances at that time. The request will certainly fail because node4 cannot claim another 400MB memory. The request 8582 still makes this wrong decision because the cached host state told that node4 can fit another instance with 689MB memory(#L92). It looks all host state's fault!

So back again to check how host state of node4 went wrong... and here it is. After scheduling two instances to node4, #L66 shows that the host state was updated by database that told scheduler node4 has 1089MB RAM. That is to say, the db contained outdated message but scheduler still accepted it. But according to logic[1], the cache is assumed outdated because the update time of db is newer, why still wrong?

Because there is a time GAP between scheduler makes a decision and the compute node persistents this decision. For example in the log #L66 [2], the first request 132f was persisted to db at 17:19:45, but the second request 8de7 still hadn't been persisted to db by the time 17:19:45.242. However, the node4 state cache was updated by the third request 8582 at 17:19:44.935. So that host state is refreshed by db whose updated time is newer, but it caused the lost of consumption from request 8de7 and bad things happened!

The same bad happened again at the first retry of request 8582(the 10th column of [3]) to choose the wrong node2, because node2 state cache was updated by outdated data from sluggish db at #L113 [2].

----- possible solution -----

Synchronize host state cache with database is very important in making right decisions and supporting multiple scheduler instances. However, there is always a time gap between the scheduler making a consumption and this consumption is persisted to database. The scheduler may lost some recent consumptions after being refreshed from database. What if the scheduler can *remember* those recent consumptions as resource claims and reapply them to the updated host state cache? It can help fill the gap.

----- the extra -----

Current update model of host state cache has some issues. Firstly, a host state cache will be updated even if it is already the same with the latest db data according to code [1]. For example in previous log [2], #L16, L17, L29 are unnecessary updates regardless that they do no harm to making the right decision. Secondly, assume a host state cache is created based on db data1 at time t1, and later, there comes db data2 updated at t2(t2 > t1), but host state has already been consumed at t3(t3 > t2). Althouth data2 is newer than data1, scheduler will still discard the new information and use the older one, because `HostState.updated > compute.updated_at` according to code [1].

Another thing is, I see no race conditions between greenthreads after doing many experiments of sending requests in parallel. For example in log[4] I've triggered 6 successful parallel requests. The results are always correct if the database does not make troubles. I think it is because that there is no `sleep(0)` or any asynchronous request(except for the experimental trusted filter) during filtering, weighing and consuming. So these operations can be treat as atomic as a whole.

[1] https://github.com/openstack/nova/blob/master/nova/scheduler/host_manager.py#L162-L164
[2] http://paste.openstack.org/show/482724/
[3] http://paste.openstack.org/show/482725/
[4] http://paste.openstack.org/show/482728/