OpenStack Compute (nova)

Bug #1341420
Comment #11

Comment 11 for bug 1341420

Revision history for this message

Robert Collins (lifeless) wrote on 2014-07-14:

#11

Sure, but this is pathological, we're scheduling incredibly poorly if requests come in at a fast enough rate. Ironic doesn't change *anything* about the basic approach, and the problem isn't unique to Ironic - KVM with machine sized VMs will behave identically, for instance.

IMO we need to consider this bug a result of design decisions, and we can revisit those after learning about the issues they have....

The issue here is that there is a data structure maintained by the compute process (perhaps via conductor for actual DB writes) but consulted by the scheduler process(es).

If we take a leaf from the BASE design guidelines, we might have this look something like this as a small change to fix things, using a read-after-write pattern (which is in some ways heinous...)

add a scheduler grants table (timestamp, host, memory_gb, cpu_count, disk_gb)

scheduler receives a request:
- gets a baseline in-memory view using the current read-all-rows approach
- updates that by reading outstanding grants (or all and discard-in-memory)
- schedules
- writes a grant to the DB (using DB set timestamp)
- reads from the DB to see if there were other grants made for the same host(s), and if they would have invalidated the grant.
- if and only if they invalidate it and the timestamp on the invalidating grant was less than ours, fall over to the next viable host and remove the grant (we lost), otherwise we won the conflict.
- return results

compute process receives a instance to build (or migrate):
- takes its local claim-lock
- reads the grant from the scheduler table (select from ... where host=self.hostname and cpu_count=... order by timestamp asc limit 1)
- updates its host row in the DB with the now available resources
- also sets the last-grant timestamp with the timestamp from that grant

And we add a periodic job to the scheduler to delete rows from the grant table where the timestamp is less than the timestamp in the compute table (or there is no matching compute row).

This would have the following properties:
- min two more DB writes per scheduled instance (one to write the grant, one to delete it later)
- min three more DB reads per scheduled instance (compute host has to read the grant, and the scheduler has to read all the unhandled grants and subtract them before scheduling, and scheduler has to look for conflicting grants after scheduling before returning)
- up to viable-hosts extra writes and reads to handle falling over to the next viable host in the event of a conflict. Assume there are 10 schedulers and someone is flooding the system with identical requests. In each round one will one but it will rejoin with the next request and presumably come up with the same next host that all the others have fallen over to. In this scenario scheduling will end up single threaded (with 9 threads of spinlock failures at any point in time), but it won't deadlock or halt.

IMO we need to consider this bug a result of design decisions, and we can revisit those after learning about the issues they have....

The issue here is that there is a data structure maintained by the compute process (perhaps via conductor for actual DB writes) but consulted by the scheduler process(es).

If we take a leaf from the BASE design guidelines, we might have this look something like this as a small change to fix things, using a read-after-write pattern (which is in some ways heinous...)

add a scheduler grants table (timestamp, host, memory_gb, cpu_count, disk_gb)

scheduler receives a request:
 - gets a baseline in-memory view using the current read-all-rows approach
 - updates that by reading outstanding grants (or all and discard-in-memory)
 - schedules
 - writes a grant to the DB (using DB set timestamp)
 - reads from the DB to see if there were other grants made for the same host(s), and if they would have invalidated the grant.
   - if and only if they invalidate it and the timestamp on the invalidating grant was less than ours, fall over to the next viable host and remove the grant (we lost), otherwise we won the conflict.
 - return results

compute process receives a instance to build (or migrate):
 - takes its local claim-lock
 - reads the grant from the scheduler table (select from ... where host=self.hostname and cpu_count=...  order by timestamp asc limit 1)
 - updates its host row in the DB with the now available resources
 - also sets the last-grant timestamp with the timestamp from that grant

And we add a periodic job to the scheduler to delete rows from the grant table where the timestamp is less than the timestamp in the compute table (or there is no matching compute row).

This would have the following properties:
 - min two more DB writes per scheduled instance (one to write the grant, one to delete it later)
 - min three more DB reads per scheduled instance (compute host has to read the grant, and the scheduler has to read all the unhandled grants and subtract them before scheduling, and scheduler has to look for conflicting grants after scheduling before returning)
 - up to viable-hosts extra writes and reads to handle falling over to the next viable host in the event of a conflict. Assume there are 10 schedulers and someone is flooding the system with identical requests. In each round one will one but it will rejoin with the next request and presumably come up with the same next host that all the others have fallen over to. In this scenario scheduling will end up single threaded (with 9 threads of spinlock failures at any point in time), but it won't deadlock or halt.