CapacityFilter intermittently fails in the gate due to insufficient virtual space on the host

Bug #1741275 reported by Matt Riedemann
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
High
Matt Riedemann

Bug Description

Seen here:

http://logs.openstack.org/69/530969/1/check/tempest-full/223c175/controller/logs/screen-c-sch.txt

VolumesBackupsTest-1618334087 None] Checking if host ubuntu-xenial-inap-mtl01-0001669860@lvmdriver-1#lvmdriver-1 can create a 1 GB volume (907e0662-4844-4b30-ae69-07c01482d36a) {{(pid=13148) backend_passes /opt/stack/cinder/cinder/scheduler/filters/capacity_filter.py:62}}
Jan 03 17:51:40.739901 ubuntu-xenial-inap-mtl01-0001669860 cinder-scheduler[13148]: WARNING cinder.scheduler.filters.capacity_filter [None req-b28e1f6d-4c44-40ee-b529-4c05b4f64fb5 tempest-VolumesBackupsTest-1618334087 None] Insufficient free virtual space (-10.6GB) to accommodate thin provisioned 1GB volume on host ubuntu-xenial-inap-mtl01-0001669860@lvmdriver-1#lvmdriver-1.
Jan 03 17:51:40.740001 ubuntu-xenial-inap-mtl01-0001669860 cinder-scheduler[13148]: DEBUG cinder.scheduler.base_filter [None req-b28e1f6d-4c44-40ee-b529-4c05b4f64fb5 tempest-VolumesBackupsTest-1618334087 None] Filter CapacityFilter returned 0 host(s) {{(pid=13148) get_filtered_objects /opt/stack/cinder/cinder/scheduler/base_filter.py:125}}

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Insufficient%20free%20virtual%20space%5C%22%20AND%20message%3A%5C%22to%20accommodate%20thin%20provisioned%201GB%20volume%20on%5C%22%20AND%20tags%3A%5C%22screen-c-sch.txt%5C%22&from=7d

11 hits in 7 days, check and gate, all failures, looks like it maybe started around Jan 1.

Tags: scheduler
Matt Riedemann (mriedem)
Changed in cinder:
status: New → Confirmed
Revision history for this message
Ilya Shakhat (shakhat) wrote :

More recent occurrence of the issue: http://logs.openstack.org/64/519664/3/check/tempest-full/dbb3f0e/controller/logs/screen-c-sch.txt.gz#_Jan_23_18_55_00_424738

Analysis of the issue (according to logs from Cinder scheduler):

Jan 23 18:54:11.059512 - scheduler receives status update with free_capacity_gb = 9.47
Jan 23 18:54:11.169724 - filter reports that free_capacity_gb: 9.47
Jan 23 18:54:12.503192 - filter reports that free_capacity_gb: 8.47
....
Jan 23 18:54:57.731754 - filter reports that free_capacity_gb: 0.47
Jan 23 18:55:00.424738 - Insufficient free virtual space (-10.6GB)

Value -10.6 is calculated and one of ways to get it is when backend_state.free_capacity_gb = -0.53 (https://github.com/openstack/cinder/blob/master/cinder/scheduler/filters/capacity_filter.py#L74-L142)

The next status update goes at Jan 23 18:55:05.453072 and it resets the free space value back to 9.47

To sum up: between 2 updates of capacity numbers scheduler allocates a number of volumes; each allocation reduces local value of free disk space, until it gets below 0.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This started spiking again since March 13:

https://bugs.launchpad.net/cinder/+bug/1756220

Revision history for this message
Matt Riedemann (mriedem) wrote :

For the recent spike, my guess is this is the thing that tipped it over:

https://github.com/openstack/cinder-tempest-plugin/commit/c82162a385040fcf70444a10bae98ca3f2823bc2

There are no other related looking changes in devstack or cinder since around March 12th.

Revision history for this message
Matt Riedemann (mriedem) wrote :

mtreinish pointed out that that test from the tempest plugin should only get loaded in certain jobs, and he's right, I was looking at this failed job:

http://logs.openstack.org/02/539702/3/check/legacy-tempest-dsvm-cells/17731a9/job-output.txt.gz

And that test isn't in there, so it must be something else.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This tempest changed merged on March 13th to rename some volume tests:

https://github.com/openstack/tempest/commit/87a069d2a0812a35248eb436583b1c6118631426

mtreinish thought that might change the test order and thus make some other volume tests run concurrently where they didn't before which could cause scheduling issues with the CapacityFilter.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/553794

Changed in cinder:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Matt Riedemann (mriedem)
Changed in cinder:
importance: Undecided → High
Matt Riedemann (mriedem)
tags: added: scheduler
Revision history for this message
Gorka Eguileor (gorka) wrote :

Looking at the actual error in the logs makes me think that we need to increase the max_over_subscription_ratio configured in the gate, create a bigger backend, or reduce the speed at which we create volumes.

That or we stop deducting the full volume size in the scheduler from free_capacity_gb in the Cinder scheduler for thin provisioned volumes.

On a side note, there seems to be some other issues in the Cinder Scheduler:

- We can get to negative numbers for free_capacity_gb

- We are logging some incoherent things:

Looking at req-8cd996a1-3662-44ab-9ca2-593aa84833b3 in one of the failures [1] we would assume that the backend has more than enough space for the volume from the logged information at Mar 16 13:26:41.201266:

{'provisioned_capacity_gb': 1.0,
 'volume_backend_name': 'lvmdriver-1',
 'location_info': 'LVMVolumeDriver:ubuntu-xenial-rax-ord-0003018294:stack-volumes-lvmdriver-1:thin:0',
 'free_capacity_gb': 22.76,
 'allocated_capacity_gb': 7,
 'total_volumes': 2,
 'pool_name': 'lvmdriver-1',
 'reserved_percentage': 0,
 'max_over_subscription_ratio': '20.0',
 'total_capacity_gb': 22.8}

But that's not the actual data that is used for filtering as can be seen in the log entry of that same request at Mar 16 13:26:41.202743:

Backend: host 'ubuntu-xenial-rax-ord-0003018294@lvmdriver-1#lvmdriver-1'
free_capacity_gb: -0.23999999999999844,
total_capacity_gb: 22.8
allocated_capacity_gb: 30
max_over_subscription_ratio: 20.0
reserved_percentage: 0
provisioned_capacity_gb: 24.0

[1] http://logs.openstack.org/67/550967/4/check/tempest-full-py3/c67e31b/controller/logs/screen-c-sch.txt

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/553794
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=dfce8aef57ba03105459c0caff99c112ba373ad1
Submitter: Zuul
Branch: master

commit dfce8aef57ba03105459c0caff99c112ba373ad1
Author: Matt Riedemann <email address hidden>
Date: Fri Mar 16 13:48:39 2018 +0000

    Revert "Stop unnecessarily querying storage for stats"

    This reverts commit 458840ddb12164c27054413bb7a7ffcd168b614e.

    We've seen a spike in scheduling failures since this merged,
    presumably because the scheduler doesn't have the latest
    information and thinks the storage backend is at capacity.

    Change-Id: I5b6645cfd1277f315eb9cf186848d0eea8bc0b98
    Closes-Bug: #1741275

Changed in cinder:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/cinder 13.0.0.0b1

This issue was fixed in the openstack/cinder 13.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/651172

Revision history for this message
Sahid Orentino (sahid-ferdjaoui) wrote :

I'm wondering whether this has been fixed by updating quota or if the revert that has been proposed did? Basically I have that same issue in one of my env. I proposed a fix here [0]. I think it's related on how we do compute the virtual space and make it against the requested space. Any comments would be appreciated.

[0] https://review.openstack.org/#/c/651172/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.