Cinder

allocated_capacity tracking is broken in manager

Bug #1408763 reported by John Griffith on 2015-01-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Fix Released	Medium	Michal Dulko	Cinder 2015.1.0 "kilo"

Bug Description

There are a few things that aren't quite right in the allocated_capacity tracking; One big problem is that in the case of a create failure we don't increment the allocated_capacity in the manager, however when we perform a delete operation of the same volume (that's in an error state) we do still decrement the allocated_capacity. This same behavior is also most likely present in Quota management as well.

Ideally we should probably cut out the manager tracking of this info altogether and push it back down to the drivers. Inparticular for the case of create failure, the driver has to be consulted on whether or not the space was allocated or not (ie how things failed). This introduces some other challenges however (not the least of which is the fact that many drivers can't/don't accurately report usage info). It might be worth considering a new driver call to deal with capacity management, and in the case that the driver doesn't implement it or returns something bogus like "infinite" or "unknown" we fall back to using ref counts from the db.

John Griffith (john-griffith) on 2015-01-12

Changed in cinder:
status:	New → Triaged
importance:	Undecided → Medium
assignee:	nobody → John Griffith (john-griffith)

Revision history for this message

John Griffith (john-griffith) wrote on 2015-01-13:

After spending a good deal of time looking at ways to fix this I've come to the conclusion that our current strategy is flawed and flat out broken. There are a number of issues involved here because of taskflow retries that will skew the counter, and also cause things to become out of sync on delete.

After a good deal of thought I see only one "right" way to fix this, and that is to push the reporting back to the driver. The challenge here is that everybody disagrees on how "numbers" should be reported. So I would like to just set one standard for the purpose of scheduler capacity filtering and move on.

I'd propose the addition of a "get_capacity_inf" method to each driver. Each driver would then be responsible for returning 3 items as defined by Cinder (not by the backends). Those 3 items would be:

    provisioned_gb: <number of GB provisioned on the backend>
    allocated_gb: <number of GB actually allocated on the backend>
    available_gb: <number of GB the backend will still allow to be provisioned>

Frankly the only one I care about is provisioned and available, but if and when we get things like over-subscription options we'll need allocated as well. The reason I chose the definitions that I have here is mostly because those are pretty standard definitions in the storage industry, but more importantly they most closely match the internal calculations we're doing in the scheduler that I would like to remove/replace.

I realize I'll get the argument from those that say they do "infinite capacity" because they just add disk shelves as they go, that's fine... but most of the cases where that's been argued they don't actually have a driver in Cinder, and in their internal deployment they don't have more than one backend. So there's no reason they can't just report "actual" numbers at the time and grow (or lie, honestly I don't care). Drivers that ARE in Cinder's source-tree however I would expect to report real and accurate numbers.

This is slightly different that the update_stats call, the reason is because I would like to have it such that the manager makes this call to the driver after every "driver.volume_create" to keep the view of the world in sync and NOT rely on the periodic like we do today. By keeping the calls to a minimum here it will hopefully avoid any crushing overload for devices that aren't very good at handling API requests.

The base driver for now should raise "not implemented" and fall back to the existing manual allocation increment/decrement routine in the manager.

After spending a good deal of time looking at ways to fix this I've come to the conclusion that our current strategy is flawed and flat out broken.  There are a number of issues involved here because of taskflow retries that will skew the counter, and also cause things to become out of sync on delete.

After a good deal of thought I see only one "right" way to fix this, and that is to push the reporting back to the driver.  The challenge here is that everybody disagrees on how "numbers" should be reported.  So I would like to just set one standard for the purpose of scheduler capacity filtering and move on.

I'd propose the addition of a "get_capacity_inf" method to each driver.  Each driver would then be responsible for returning 3 items as defined by Cinder (not by the backends).  Those 3 items would be:

provisioned_gb:  <number of GB provisioned on the backend>
    allocated_gb:       <number of GB actually allocated on the backend>
    available_gb:       <number of GB the backend will still allow to be provisioned>

Frankly the only one I care about is provisioned and available, but if and when we get things like over-subscription options we'll need allocated as well.  The reason I chose the definitions that I have here is mostly because those are pretty standard definitions in the storage industry, but more importantly they most closely match the internal calculations we're doing in the scheduler that I would like to remove/replace.

I realize I'll get the argument from those that say they do "infinite capacity" because they just add disk shelves as they go, that's fine... but most of the cases where that's been argued they don't actually have a driver in Cinder, and in their internal deployment they don't have more than one backend.  So there's no reason they can't just report "actual" numbers at the time and grow (or lie, honestly I don't care).  Drivers that ARE in Cinder's source-tree however I would expect to report real and accurate numbers.

This is slightly different that the update_stats call, the reason is because I would like to have it such that the manager makes this call to the driver after every "driver.volume_create" to keep the view of the world in sync and NOT rely on the periodic like we do today.  By keeping the calls to a minimum here it will hopefully avoid any crushing overload for devices that aren't very good at handling API requests.

The base driver for now should raise "not implemented" and fall back to the existing manual allocation increment/decrement routine in the manager.

Revision history for this message

Huang Zhiteng (zhiteng-huang) wrote on 2015-01-13:

John, thanks for looking into this bug. I'd like to explain a bit about the origin of 'allocated_capacity_gb'. This parameter was introduced (in commit 254e37ab3c995f6514084d38f2f797da9cf5e5a9) to scheduler so that filter scheduler could simulate the behavior of Simple scheduler (because we wanted to remove Simple/Chance scheduler). So allocated_capacity_gb was intentionally implemented to represent apparent Cinder level value (simply equals sum of all volumes hosted on the backend, even including volumes in 'error' state, because that was how Simple scheduler behave) and because of that the tracking and calculation of this parameter was put in volume manager instead of driver.

Before we dive into the discussion of fixing the bug, I'd like to understand are we going to change the semantic of 'allocated_capacity_gb' to something other than what it is now?

Revision history for this message

John Griffith (john-griffith) wrote on 2015-01-13:

Zhiteng,
Thanks for the extra info, makes sense; and yes it's necessary particularly when we're doing multiple creates between the periodic update right? So what I'm proposing is essentially the same value but reported by the driver and updated on create and delete. So my proposal is that the value remains (same label and still in manger.self.stats) but that it's updated as an independent call to the driver after each create as well as after each delete.

I'd like the value variable to remain the same and for the most part the process for when it's adjusted to remain the same. I just want to fix the fact that incrementing and decrementing inside the manager with the taskflow code isn't accurate (ie if a create fails it's never incremented, but it's decremented on delete).

I'd also like it to take into consideration "all" storage in use on the backend, not just Cinder's storage if that makes sense?

If this is going to cause some issues that I'm not seeing let's chat on IRC and come up with something different.

Revision history for this message

Xing Yang (xing-yang) wrote on 2015-01-13:

John,

I added "provisioned_capacity_gb" in the Over Subscription patch: https://review.openstack.org/#/c/142171/. Driver now needs to report "provisioned_capacity_gb" in addition to "free_capacity_gb" and "total_capacity_gb". Winston has been reviewing it. Can you also take a look of the patch?

Is "allocated_capacity_gb" in your definition the used (written) capacity? Just want to make sure we are referring to the same thing. If you look at my Cinder spec for Over Subscription, there's a big Terminologies section where I just specified existing and new terms regarding capacities:).

Revision history for this message

Duncan Thomas (duncan-thomas) wrote on 2015-01-14:

provisioned_gb: <number of GB provisioned on the backend (lower bound. The backend has at least this much provisioned space)>
available_gb: <number of GB the backend will still allow to be provisioned (lower bound. The backend can guarentee at least this much space is definitely available)>

allocated_gb: <number of GB actually allocated on the backend> Not sure how to bound this, nor even what 'actually allocated' means. If I have a thin provisioned 10 gig volume, with 5 gig of actual data in it that is compressed to 3 gig but I've also got 3 gig of it in the transparent caching tier, how much is allocated? 3 gig? 5 gig? 10 gig?

Mike Perez (thingee) on 2015-01-16

Changed in cinder:
milestone:	none → kilo-2

Revision history for this message

John Griffith (john-griffith) wrote on 2015-01-16:

Looked at this quite a bit as well as a number of options; preferred fix was stated as using taskflow but honestly I can't figure out a good way to actually do that. Letting this one go to somebody that understands how to work in taskflow better than I do.

Changed in cinder:
assignee:	John Griffith (john-griffith) → nobody

Mike Perez (thingee) on 2015-02-02

Changed in cinder:
milestone:	kilo-2 → kilo-3

Revision history for this message

Michal Dulko (michal-dulko-f) wrote on 2015-02-05:

Here's how Nova is solving this problem:

1. It's assumed that scheduler may have invalid data.
2. Libvirt is responsible for tracking of resources on each compute node.
3. If compute node gets a request that exceeds it's quota then it triggers reschedule and updates scheduler DB.
4. Apart from that there's periodic task updating this info from all compute nodes.
5. In the end you achieve eventual consistency.

So this is probably the solution with moving tracking back to drivers.

I wonder why not just move incrementing allocated_capacity before the call to the driver? This will fix the issue jgriffith explained in the bug title. It would be better if John would explain his thoughts and ideas that didn't work to get in sync with his work.

Michal Dulko (michal-dulko-f) on 2015-02-06

Changed in cinder:
assignee:	nobody → Michal Dulko (michal-dulko-f)
status:	Triaged → In Progress

Revision history for this message

Michal Dulko (michal-dulko-f) wrote on 2015-02-06:

Okay, I've read jgriffith's explanations from the meeting and here's my idea how to fix it:

1. In _pre_reschedule of OnFailureRescheduleTask reset also the volume's host
2. Add try-except block to manager's create_volume and in except clause check if volume was actually rescheduled (can be done using flow_engine.storage) and increment allocated_capacity only if it was.
3. If no exception occurred - proceed normally.

This should take care of it. The problem isn't in internal TaskFlow retries (as Cinder flows aren't configured to do so) but in rescheduling mechanism inherited from Nova. In my opinion it was broken before moving to TaskFlow.

Revision history for this message

John Griffith (john-griffith) wrote on 2015-02-06:

@Michal,
Certainly wasn't broken before, not to mention the tracking of that value didn't exist before taskflow anyway. Prior to taskflow the manager had it's own retry methods, which it was able to count and get responses from in the caller.

Regardless, it sounds like you've got some ideas on how to fix this up so that's great, you'll need to change some things in the runner though because the exception never goes back to the manager. Anyway, looking forward to seeing what you come up with.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-11: Fix proposed to cinder (master)

#10

Fix proposed to branch: master
Review: https://review.openstack.org/154920

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-19: Fix merged to cinder (master)

#11

Reviewed: https://review.openstack.org/154920
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=ad02f8208b76b27513ca8a8c21b255d98080d8b3
Submitter: Jenkins
Branch: master

commit ad02f8208b76b27513ca8a8c21b255d98080d8b3
Author: Michal Dulko <email address hidden>
Date: Thu Feb 12 09:39:08 2015 +0100

Fix allocated_capacity tracking when rescheduling

    This aims to fix capacity tracking in the manager in case of
    rescheduling. Idea is to inject information if volume was rescheduled
    into the exception passed up so manager can decide if incrementing
    allocated_capacity is needed. Then in the manager finally block is used
    to make sure that allocated_capacity is incremented in case of failure
    that haven't triggered rescheduling. Also adding two unit tests checking
    if tracking is correct.

Closes-Bug: 1408763

Change-Id: Icd6b04ac60c17cbda225d3be8bcc4435ea0cd7a8

Changed in cinder:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2015-03-20

Changed in cinder:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in cinder:
milestone:	kilo-3 → 2015.1.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.