Sporadic incoherent metrics when driver.get_host_cpu_stats takes longer than 1 second to execute

Bug #1490837 reported by Joe Cropper
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Joe Cropper

Bug Description

When using the libvirt CPU monitor (i.e., virt_driver) for metrics collection, I sporadically noticed cases where the values for cpu.user.percent + cpu.kernel.percent + cpu.idle.percent didn't equal 100, which should be the case. This wasn't happening very often so it was quite difficult to track down, but after adding several debug logs, over time, I was able to track down the problem.

If you look at this code:
https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L52

... you'll notice that there is an inherent assumption that for a given "round" of metrics gathering, there is a built-in assumption that the collective time to call metric_driver.get_metric(metric_name) (keep in mind there are 10 metrics right now) won't exceed 1 second (if it does, it's considered the "next" round of metric collection) ... i.e., so for the first metric collection, we refresh the host CPU stats and the subsequent n-1 calls simply use the cache ... this yielding a coherent answer (i.e., the percentages would all sum up to 100% as you'd expect).

However, in some cases (e.g., if the system is undergoing stress, etc.), I've seen cases where this code:
https://github.com/openstack/nova/blob/master/nova/compute/monitors/cpu/virt_driver.py#L60

... takes more than 1 second to execute, which then causes [within the "same" metrics round] the data to be refreshed, thus yielding potentially incoherent results (e.g., summation of percentages < 100 or > 100 -- makes for some interesting data points). :-)

The fix is simple... let's just move the timestamp cache *after* the host stats have been collected... problem solved.

P.S. This problem is occurring on Liberty (and I suspect it would happen on older releases too).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/219153

Changed in nova:
assignee: nobody → Joe Cropper (jwcroppe)
status: New → In Progress
Changed in nova:
assignee: Joe Cropper (jwcroppe) → Jay Pipes (jaypipes)
Jay Pipes (jaypipes)
Changed in nova:
assignee: Jay Pipes (jaypipes) → Joe Cropper (jwcroppe)
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/219153
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7fe05a8ca45b36c173458b92aa6ad6d9df861736
Submitter: Jenkins
Branch: master

commit 7fe05a8ca45b36c173458b92aa6ad6d9df861736
Author: Joe Cropper <email address hidden>
Date: Tue Sep 1 01:18:15 2015 -0500

    Ensure Nova metrics derived from a set of metrics

    This patch addresses a problem in which the current Nova metrics can
    yield a set of incoherent results. For example, prior to this patch,
    it is possible that a collection of metrics sent in a single RPC
    notification message can contain values such as the following:

    {'cpu.user.percent': A, 'cpu.kernel.percent': B,
     'cpu.idle.percent': C, 'cpu.iowait.percent': D} such that you can
    end up with A + B + C + D != 100, which does not make any sense
    from an end consumer perspective.

    The root cause of this problem stems from the fact that each
    individual metric retrieval (i.e., via get_metric(name)) could
    result in the driver.get_host_cpu_stats() function being called,
    which leads to the incoherent results mentioned above.

    To solve this, we only refresh the metric data once per collection
    cycle. Additionally, the `Monitor` base class uses get_metrics()
    instead of get_metric() and returns a 3-tuple of the form
    (metric_name, metric_value, timestamp).

    This patch also removes some code from the CPU virt_driver's
    _update_data() method that checked whether the last call was
    less than 1 second ago, which seemed to try to solve this
    coherency issue, but didn't really work all of the time (and hence
    how this defect was encountered to begin with). Given we use the
    [more appropriate] metric set approach now, this code no longer
    serves any purpose.

    Unit tests updated to work with the new framework.

    Change-Id: Ia23783f2e6b92b3af2e9acf09f95a9eeacd8eb3b
    Closes-Bug: 1490837
    Co-authored-by: Jay Pipes <email address hidden>

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/nova 13.0.0.0b1

This issue was fixed in the openstack/nova 13.0.0.0b1 development milestone.

Changed in nova:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.