scheduler: build failure high negative weighting

Bug #1818239 reported by James Page
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Incomplete
Undecided
Unassigned
OpenStack Nova Cloud Controller Charm
Fix Released
High
James Page
OpenStack Security Advisory
Won't Fix
Undecided
Unassigned
nova (Ubuntu)
Triaged
High
Unassigned

Bug Description

Whilst debugging a Queens cloud which seems to be landing all new instances on 3 out of 9 hypervisors (which resulted in three very heavily overloaded servers) I noticed that the weighting of the build failure weighter is -1000000.0 * number of failures:

  https://github.com/openstack/nova/blob/master/nova/conf/scheduler.py#L495

This means that a server which has any sort of build failure instantly drops to the bottom of the weighed list of hypervisors for scheduling of instances.

Why might a instance fail to build? Could be a timeout due to load, might also be due to a bad image (one that won't actually boot under qemu). This second cause could be triggered by an end user of the cloud inadvertently causing all instances to be pushed to a small subset of hypervisors (which is what I think happened in our case).

This feels like quite a dangerous default to have given the potential to DOS hypervisors intentionally or otherwise.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: nova-scheduler 2:17.0.7-0ubuntu1
ProcVersionSignature: Ubuntu 4.15.0-43.46-generic 4.15.18
Uname: Linux 4.15.0-43-generic x86_64
ApportVersion: 2.20.9-0ubuntu7.5
Architecture: amd64
Date: Fri Mar 1 13:57:39 2019
NovaConf: Error: [Errno 13] Permission denied: '/etc/nova/nova.conf'
PackageArchitecture: all
ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
SourcePackage: nova
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
James Page (james-page) wrote :

as a side note its really hard to see the calculated weights for each host in the scheduler as the weighting is stripped before the debug log message is made here:

  https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L460

I figured what was happening out by logging the list of WeighedHosts rather than the encapsulated obj's

Revision history for this message
James Page (james-page) wrote :

for further context the only way I could find to reset the build_failures value for a hypervisor was to restart its nova-compute daemon.

Revision history for this message
James Page (james-page) wrote :

related bug 1742102

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Based on the fix for bug 1742102 a successful build on the host should reset the failure count to zero, but it seems impractical to get a successful build if it has a low weighting. So the code does still seem to leave the ability for a user to intentionally increase the failure count and lower the weight for a compute node or nodes, limiting the number of nodes an instance can be scheduled to.

James Page (james-page)
Changed in nova (Ubuntu):
status: New → Won't Fix
information type: Private Security → Public Security
Changed in charm-nova-cloud-controller:
status: New → In Progress
importance: Undecided → High
assignee: nobody → James Page (james-page)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (master)

Fix proposed to branch: master
Review: https://review.openstack.org/640698

tags: added: sts
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (master)

Reviewed: https://review.openstack.org/640698
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=c5029e9831ab5063485877213987d6827c4d86f1
Submitter: Zuul
Branch: master

commit c5029e9831ab5063485877213987d6827c4d86f1
Author: James Page <email address hidden>
Date: Mon Mar 4 09:25:46 2019 +0000

    Disable BuildFailureWeigher

    Disable the BuildFailureWeigher used when weighting hosts during
    instance scheduling. A single build failure will result in a
    -1000000 weighting which effectively excludes the hypervisor
    from the scheduling decision.

    A bad image can result in build failures resulting in a heavy
    load on hypervisors which have not had a build failure with
    those that have effectively being ignored; the build failure
    count will be reset on a successful build but due to the high
    weighting this won't happen until all resources on known good
    hypervisors have been completely consumed.

    Change-Id: I4d4367ef20e2a20aee1e26d4a0ec69cad2ac69d6
    Closes-Bug: 1818239

Changed in charm-nova-cloud-controller:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-nova-cloud-controller (stable/18.11)

Fix proposed to branch: stable/18.11
Review: https://review.openstack.org/640961

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Opening this back up against the package and adding upstream as well. I may be missing something, but I think this is still an issue upstream.

Changed in nova (Ubuntu):
status: Won't Fix → Triaged
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (stable/18.11)

Reviewed: https://review.openstack.org/640961
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=56de84a81be00bf9ddffca4426d28c17d5d9798e
Submitter: Zuul
Branch: stable/18.11

commit 56de84a81be00bf9ddffca4426d28c17d5d9798e
Author: James Page <email address hidden>
Date: Mon Mar 4 09:25:46 2019 +0000

    Disable BuildFailureWeigher

    Disable the BuildFailureWeigher used when weighting hosts during
    instance scheduling. A single build failure will result in a
    -1000000 weighting which effectively excludes the hypervisor
    from the scheduling decision.

    A bad image can result in build failures resulting in a heavy
    load on hypervisors which have not had a build failure with
    those that have effectively being ignored; the build failure
    count will be reset on a successful build but due to the high
    weighting this won't happen until all resources on known good
    hypervisors have been completely consumed.

    Change-Id: I4d4367ef20e2a20aee1e26d4a0ec69cad2ac69d6
    Closes-Bug: 1818239
    (cherry picked from commit c5029e9831ab5063485877213987d6827c4d86f1)

Revision history for this message
Jeremy Stanley (fungi) wrote :

Is the denial of service concern that an authenticated user could engineer a build failure (perhaps by attempting to boot an intentionally corrupt image they uploaded) and perform that action repeatedly to cause the environment to no longer to be able to schedule instances to any of the hypervisor hosts through which their build failures rotated?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

@Jeremy, I think it's more of limited denial of service (if we can call it that) where a certain amount of computes could get negative weight and not considered for scheduling. I don't think it's a complete denial of service. For example, in the case you've mentioned the failure weight would become somewhat equivalent for all nodes if they all have a negative weight.

Revision history for this message
Jeremy Stanley (fungi) wrote :

Thanks! I'm mostly looking for an exploit scenario whereby a malicious actor can intentionally cause harm/deny access to the operating environment for other users. Absent this, we'd probably not bother to issue a security advisory about it.

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1818239] Re: scheduler: build failure high negative weighting

On Tue, 2019-03-05 at 18:30 +0000, Corey Bryant wrote:
> @Jeremy, I think it's more of limited denial of service (if we can
> call
> it that) where a certain amount of computes could get negative weight
> and not considered for scheduling. I don't think it's a complete
> denial
> of service.

I believe the term you are looking for is "degradation of service" -
https://en.wikipedia.org/wiki/Denial-of-service_attack#Degradation-of-service_attacks

Revision history for this message
Matt Riedemann (mriedem) wrote :

@James: per comment 2, see bug 1816360 :) Easy fix for that.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I've marked this as incomplete for nova since I'm not aware of any changes being asked to make here. The build failure weigher was added because of bug 1742102 and in response to operator feedback from the Boston summit to auto-disable computes if they experienced a build failure. So the auto-disable thing went into I think Pike, and then bug 1742102 talked about how that was too heavy weight because there were easy ways to auto-disable a lot of computes in a deployment (e.g. volume over-quota from multiple concurrent boot from volume server create requests). So Dan added the weigher stuff which can be configured to weigh hosts with build failures lower so they don't become a black hole, and once the host has a successful build the stats tracking for that compute node is reset.

Anyway, it's up to charms if it wants to disable it by default, but note it was added for a reason and defaults to being "on" for a reason as well (per operator feedback).

Changed in nova:
status: New → Incomplete
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

@fungi - we accidentally took down 9/12 of the hypervisors in our QA cloud with this; 75% isn't quite a complete denial of service but definitely degraded the capacity significantly

Revision history for this message
James Page (james-page) wrote :

@mriedem - yeah that was my hack but I see you beat me to raising a review...

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Matt, What is your opinion on nova disabling the build failure weigher by default. It would then be secure by default, without any exposure to degradation of service attacks, and folks can opt in to it if they want. Btw, did you mean to triage as won't fix or incomplete? I think we have enough details to describe the issue but if anything else is needed please let us know.

Revision history for this message
Jeremy Stanley (fungi) wrote :

Chris: I don't doubt that this could be a crippling incident, but you say you took down your own cloud and did so accidentally... can you provide a similar scenario where a non-admin user is able to intentionally bring about the same result? That's mostly what I'm looking for to be able to formulate an appropriate impact description for a possible security vulnerability advisory publication.

Revision history for this message
Dan Smith (danms) wrote :

With the weigher, you shouldn't be able to "take down" anything. You may stack a lot more instances on the non-error-reporting hosts, but once those are full, the scheduler will try one fo the hosts reporting errors, and as soon as one succeeds there, the score resets to zero. So can you clarify "took down" in this context?

Also, the weight given to this weigher, like all others, is configurable. If you have no desire to deprioritize failing hosts, you can set it to zero, and if you want this to have a smaller impact then you can change the weight to something smaller. The default weight was carefully chosen to cause a failing host to have a lower weight than others, all things equivalent. Since the disk weigher scales by free bytes (or whatever), if you're a new compute node that has no instances (and thus a lot of free space) and a bad config that will cause you to fail every boot, the fail weigher has to have an impactful score, else it really will have no effect.

I've nearly lost the will to even argue about this issue, so I'm not sure what my opinion is on setting the default to zero, other than to say that the converse argument is also true... If you have one compute node with a broken config (or even just something preventing it from talking to neutron), it will attract all builds in the scheduler, fail them, and the cloud is effectively down until a human is paged to remedy the situation. That was the case this was originally trying to mitigate in its original and evolved forms.

Changed in charm-nova-cloud-controller:
milestone: none → 19.04
status: Fix Committed → Confirmed
status: Confirmed → Fix Released
Revision history for this message
Jeremy Stanley (fungi) wrote :

Since this has come up again in bug 1581977 as representing a security-related concern, I'm adding the security bugtag to it for increased visibility. Note this is not the same as treating it as a security vulnerability, and I don't have the impression that any CVE assignment or security advisory is warranted for this.

information type: Public Security → Public
Changed in ossa:
status: New → Won't Fix
tags: added: security
Revision history for this message
Eric Miller (erickmiller) wrote :

I'm at a tad late (5+ years after the last comment), but in case it helps, we just ran into a situation where a user (not cloud admin) exhausted their IP allocations in a user-defined subnet, which caused an error in the nova-compute.log:

NoMoreFixedIps: No fixed IP addresses available for network: 62c7b2e2-4406-4dda-9a01-f0ec623335d4

which caused the BuildFailureWeigher to offset the weight by -1000000.

This was in a Stein cluster, so quite old. A newer version may have this disabled by default, but we had to manually disable it. Regardless, it caused all other tenants' VM deployments to fail since every compute node that tried to deploy the above VM resulted in the above error, causing each compute node to be assigned a large negative weight.

Just wanted to mention our experience in case it helped someone else who was searching for the problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.