Sporadic incoherent metrics when driver.get_host_cpu_stats takes longer than 1 second to execute
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
Joe Cropper |
Bug Description
When using the libvirt CPU monitor (i.e., virt_driver) for metrics collection, I sporadically noticed cases where the values for cpu.user.percent + cpu.kernel.percent + cpu.idle.percent didn't equal 100, which should be the case. This wasn't happening very often so it was quite difficult to track down, but after adding several debug logs, over time, I was able to track down the problem.
If you look at this code:
https:/
... you'll notice that there is an inherent assumption that for a given "round" of metrics gathering, there is a built-in assumption that the collective time to call metric_
However, in some cases (e.g., if the system is undergoing stress, etc.), I've seen cases where this code:
https:/
... takes more than 1 second to execute, which then causes [within the "same" metrics round] the data to be refreshed, thus yielding potentially incoherent results (e.g., summation of percentages < 100 or > 100 -- makes for some interesting data points). :-)
The fix is simple... let's just move the timestamp cache *after* the host stats have been collected... problem solved.
P.S. This problem is occurring on Liberty (and I suspect it would happen on older releases too).
Changed in nova: | |
assignee: | Joe Cropper (jwcroppe) → Jay Pipes (jaypipes) |
Changed in nova: | |
assignee: | Jay Pipes (jaypipes) → Joe Cropper (jwcroppe) |
importance: | Undecided → Medium |
Changed in nova: | |
status: | Fix Committed → Fix Released |
Fix proposed to branch: master /review. openstack. org/219153
Review: https:/