Comment 27 for bug 1341420

Revision history for this message
Mark Goddard (mgoddard) wrote :

We have a 15 node test cluster in our lab, which is managed by Nova and Ironic. We have run into this issue when attempting to boot all 15 nodes, and in general a number of nodes will fail to boot. This can be seen when using heat or nova directly.

We have developed a couple of changes that drastically improve the behaviour.

The first change, which I believe fixes a bug, is to ensure appropriate resource limits are added to the Exact*Filter filters, in the same way as is done for the other filters. For instance, the DiskFilter writes the calculated disk_gb resource limit to a HostState object in host_passes, if it passes through the filter (see https://github.com/openstack/nova/blob/master/nova/scheduler/filters/disk_filter.py#L59). Conversely, the ExactDiskFilter does not set the disk_db limit on the host state (see https://github.com/openstack/nova/blob/master/nova/scheduler/filters/exact_disk_filter.py#L23). I think that this limit should be updated in the exact filters also. This limit, if present, is later checked in the compute service during the claim. This allows the claim to verify that the requested resources are available, with the synchronisation provided by doing the check in compute. What this achieves, is to force invalid claims to fail in the compute service, rather than succeeding and causing strange problems with multiple instances trying to provision a single Ironic node.

The second change we have developed is to recognise that with when scheduling onto homogeneous Ironic nodes, we will typically have a large number of compute hosts pass through the scheduler filters, all with equal weight. In the default configuration, with scheduler_host_subset_size = 1, concurrent requests will all schedule onto the same node, the first in the list. As discussed earlier, increasing this number reduces the chances of a collision, but how big should it be? If we go too high, then we undermine the weighting system, allowing lower weight hosts a chance to be scheduled when they would not otherwise be. Our solution to this is to extend the host subset size to include all top weight hosts. Therefore, we can set the -scheduler_host_subset_size to a sensible number, but allow scheduling onto all of the 'equally best' nodes. Although the race still exists, the chance of hitting it is reduced.

Any thoughts on these approaches?