Errors

Error rate incorrectly spikes with any influx of machines

Bug #1069827 reported by Matthew Paul Thomas on 2012-10-22

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Errors	Confirmed	High	Unassigned

Bug Description

The average daily error rate for an Ubuntu version, e.g. Ubuntu 12.10, is calculated as
number of error reports received from Ubuntu 12.10 machines that day
divided by
number of Ubuntu 12.10 machines that reported any errors in the past 90 days.

That denominator is our best estimate of the number of machines that *would* have reported errors if they'd experienced any that day. (We don't have any way of counting machines that never report errors.) But this estimate falls apart when the number of machines spikes or plummets, as it did with the release of Ubuntu 12.10 on October 18th. <https://errors.ubuntu.com/>

For example, imagine that Ubuntu 12.10 was installed on zero machines until 12:00 a.m. on October 18th. Imagine that it instantly became installed on millions of machines, but there were no further installations in the days afterward.

At the end of October 18th, every Ubuntu 12.10 machine the error tracker knew about would be a machine that had reported at least one error that day. We wouldn't be including all those machines that would have reported errors but didn't encounter any. So where x = the real error rate, our calculation would instead return an error rate of about x + 1.

Continuing this pattern, on October 19th the result of our calculation would drop to about x + 1/2, on October 20th it would drop to about x + 1/3, on October 21st it would drop to about x + 1/4, and so on.

(I'm probably wrong with that part, because I'm ignoring a bunch of things -- for example, that the machines that experienced errors previously are more likely to experience errors in future.)

Reality isn't so stark, for three reasons:
(1) a small percentage of machines now running Ubuntu 12.10 were running pre-release versions
(2) Ubuntu 12.10 wasn't released until about 6pm on the 18th
(3) even those waiting for the release didn't all install on release day.

Nevertheless, our error rate calculation shows this problem quite well:
* September 18th to October 17th, it was between 0.10 and 0.15
* October 18th (12.10 being released about 6pm), it was 0.28
* October 19th, 0.66
* October 20th, 0.56
* October 21st, 0.45

We can expect the calculation to continue declining until it reaches the "real" daily error rate -- probably between 0.10 and 0.15, like it was before.

This theory is supported by looking at the error reports themselves. There is no spike in relative frequency of any individual problem that pre-release testers weren't seeing (for example, a problem with the Ubuntu installer). There are just more reports of every problem.

This would also explain the smaller spike seen at beta 1, from an error rate of 0.08 on release day September 6th to 0.15 on September 7th.

This bug will be fixed once our calculation no longer produces an error rate that spikes after release days. A mathematician could probably give us a good way to do this.

Some possible approaches, most hackish first:

(a) Take a machine into account only once it has reported at least two errors.

(b) Take a machine into account only once it has been known for at least 1/x′ days, where x′ = the error rate for the previous day.

(c) Weight a machine's individual error count by the number of days since its first error report -- ignore it altogether on the first day, multiply it by 1/2 the next day, 2/3 the day after, 3/4 the day after that, etc.

See original description

Revision history for this message

Evan (ev) wrote on 2012-10-22:

[17:05:08] <slangasek> but what you could do is tack an install-date into the submission data
[17:05:34] <slangasek> you could then retroactively include that machine in the stats for every day since it was installed
[17:05:49] <slangasek> (the math would be fierce though, left as an exercise yadda)
[17:07:56] <slangasek> again that's a retroactive calculation, so it doesn't keep the spike from appearing at release time but it will make it disappear after the fact when recalculated

Matthew Paul Thomas (mpt) on 2012-10-22

description:

updated

Matthew Paul Thomas (mpt) on 2012-10-22

description:

updated

Revision history for this message

Jef Spaleta (jspaleta) wrote on 2012-10-24:

pardon,
Is there any way you can cycle back and give a clear problem statement as to what you are trying to generate a metric for?

Right now the descriptive phrasing being used to describe the calculation does not fit with the calculation being performed so I'm a bit confused as to what you are tryiing to calculate.

What you are calculation is the ratio of daily error reports over the 90 day average which is not adequately described as "The average daily error rate." Average daily error rate is simply the number of reports per day average over some time period... no ratioing.

This sort of ratio calculation you re doing actually enhances the sensitivity to pertubations compared to long term trends. Which seems to be exactly the opposite of what you are trying to achieve based on the fixes you've proposed. Not an unuseful calculation, especially in separateing out slow variation in well established period processes from near time responses (for reference: quiet day curves for riometers for example of the utility of the ratioing technique with complicated periodic forcing data) Which is why I ask if you can cycle back and make a more expansive statement about what you are trying to measure.

I feel like what you are trying to achieve requires some sort of surragate for install base size as a denominator.
If you accept the premise that the proportion of users with popcon enabled is a slowly changing percentage of the user base, you might be able to use the "voted" or "recent" column from ubuntu popcon stats on a daily basis as a normalization surrogate. You can't use the "installed" column as a surrogate because Ubuntu's popcon implementation isn't culling that stat correctly. But the voted and recent columns behave as expected and cull over a 30 day or so timescale similar to debian's popcon stats.

I could create a dogfoodable analysis of the approach but I'd probably need more access than I have to the popcon implementation to make a side cache of stats and well I just can't seem to put a string of 4 solid hours together right now outside of day job...but you should look into the popcon voted or recent columns and see if they make a usable demoninator that captures the periodic spikiness of the pool of reporting machines.

-jef

pardon,
Is there any way you can cycle back and give a clear problem statement as to what you are trying to generate a metric for?

Right now the descriptive phrasing being used to describe the calculation does not fit with the calculation being performed so I'm a bit confused as to what you are tryiing to calculate.

This sort of ratio calculation you re doing actually enhances the sensitivity to pertubations compared to long term trends. Which seems to be exactly the opposite of what you are trying to achieve based on the fixes you've proposed.  Not an unuseful calculation, especially in separateing out slow variation in well established period processes from near time responses (for reference: quiet day curves for riometers for example of the utility of the ratioing technique with complicated periodic forcing data) Which is why I ask if you can cycle back and make a more expansive statement about what you are trying to measure.

I feel like what you are trying to achieve requires some sort of surragate for install base size as a denominator.  
If you accept the premise that the proportion of users with popcon enabled is a slowly changing percentage of the user base, you might be able to use the "voted" or "recent" column from ubuntu popcon stats on a daily basis as a normalization surrogate. You can't use the "installed" column as a surrogate because Ubuntu's popcon implementation isn't culling that stat correctly. But the voted and recent columns behave as expected and cull over a 30 day or so timescale similar to debian's popcon stats.

-jef

Revision history for this message

Matthew Paul Thomas (mpt) wrote on 2012-10-26:

"Average daily error rate is simply the number of reports per day average over some time period"

No, that wouldn't take into account the number of reporting machines. If Ubuntu's install base doubled overnight and nothing else changed, the number of reports per day would double too, but the real error rate would be identical. What we need is a calculation that would reflect that as closely as possible.

Using any "surr[o]gate for install base size" would be inappropriate for that calculation, because we shouldn't be counting machines -- Goobuntu installs, for example -- that never submit error reports. (And using popcon would be especially inappropriate, since 12.10 and later have no GUI for popcon.) Instead we should be dividing the number of reports we receive by our best estimate for the number of machines that *would* submit errors if they had any.

That's what the 90-day count is for: if a machine submitted any errors in the past 90 days, that's a good sign that it would submit any errors it had today as well. However, it has biases in both directions. If a machine is destroyed or has Ubuntu removed, we'll keep counting it for 90 days, which biases the rate downward. Conversely, we have no way of telling how long a machine was being used error-free before it submitted its first error, which biases the rate upward. I thought those biases would cancel each other out, but the latter is biting us here. If people start using Ubuntu 12.10 on a million new machines today, 800,000 of them are used by people who would submit error reports, and 1000 of those do experience an error, all we know is that we have 1000 error reports from machines we haven't seen before. We have no way of telling the other numbers. So we need a formula for discounting (not ignoring, but discounting) those 1000 error reports until we can be more confident that those machines have been running Ubuntu for a while.

Revision history for this message

Jef Spaleta (jspaleta) wrote on 2012-10-26:

"If Ubuntu's install base doubled overnight and nothing else changed, the number of reports per day would double too, but the real error rate would be identical. What we need is a calculation that would reflect that as closely as possible."

Right... which is why I'm suggesting a surrogate for install base. The calculation you want to perform in your own words is error rate normalized to some time evolving population. The dominator must be sensitive to spikiness on the same time scale as the numerator when you normalize. The numerator must capture the release spikes. Not perfectly, but with the same general spectral content.

Normalizing to any running average is going to make you more sensitive to the numerator's spikiness. The 90 day running average his not the correct demoniator for what you want for what you want to achieve. You have to find a surrgate for installbase that tracks the spikiness of the installbase size. And the discounting idea, doesn't solve the underlying problem. You want to normalize against something the captures the spikiness of the number of machines in the wild who can report. A weighted average of previously reporting machines will not capture that..no matter how complicated you make it.

Again I would ask you to step back and define what you are trying to achieve in a metric. Why are the spikiness in the unnormalized error rate bad for you? When do expect the normalize error rate to actually show an increase instead of a flat line? You are bundling in a lot of assumptions into that running average methodology. If you are building a methodology to give the curve to give you exactly what you expect for no other reason than you expect it, you aren't building a valid methodology.

There maybe something more esoteric that you can do with a matched spectral filter ( averaging is just a flat spectral filter) if you can come up with an expected response over a release life cycle you could build a filtered response based on that expectation.
If you could provide me with the error rate data from the full 11.10 cycle I could use it to generate a spectral filter and see what happens to the 12.04 data. But even this can only be used to ask questions about how one cycle compares to another..because its expectation based normalization. Which is not what I think you want to measure, though its still not clear to me what you want. All I know for sure is, the running average of historica data is not going to capture spikiness, averaging smooths..its what it does.

-jef

Right... which is why I'm suggesting a surrogate for install base.  The calculation you want to perform in your own words is error rate normalized to some time evolving population. The dominator must be sensitive to spikiness on the same time scale as the numerator when you normalize. The numerator must capture the release spikes.  Not perfectly, but with the same general spectral content.

Normalizing to any running average is going to make you more sensitive to the numerator's spikiness.  The 90 day running average his not the correct demoniator for what you want for what you want to achieve. You have to find a surrgate for installbase that tracks the spikiness of the installbase size.  And the discounting idea, doesn't solve the underlying problem. You want to normalize against something the captures the spikiness of the number of machines in the wild who can report. A weighted average of previously reporting machines will not capture that..no matter how complicated you make it.

Again I would ask you to step back and define what you are trying to achieve in a metric. Why are the spikiness in the unnormalized error rate  bad for you?  When do expect the normalize error rate to actually show an increase instead of a flat line?  You are bundling in a lot of assumptions into that running average methodology. If you are building a methodology to give the curve to give you exactly what you expect for no other reason than you expect it, you aren't building a valid methodology.

There maybe something more esoteric that you can do with a matched spectral filter ( averaging is just a flat spectral filter) if you can come up with an expected response over a release life cycle you could build a filtered response based on that expectation. 
If you could provide me with the error rate data from the full 11.10 cycle I could use it to generate a spectral filter and see what happens to the 12.04 data. But even this can only be used to ask questions about how one cycle compares to another..because its expectation based normalization.  Which is not what I think you want to measure, though its still not clear to me what you want. All I know for sure is, the running average of historica data is not going to capture spikiness, averaging smooths..its what it does.

-jef

Revision history for this message

Matthew Paul Thomas (mpt) wrote on 2012-10-27:

Why install base is irrelevant is explained in my previous comment. To reiterate, counting machines whether they would report errors or not would make the error rate look much lower than it really is.

What we are trying to achieve in a metric is explained at <https://wiki.ubuntu.com/ErrorTracker#Rationale>.

Why spikiness is bad is explained in the bug description. To reiterate, it isn't inherently bad, but more-of-everything spikes on release days are *symptoms* of a faulty calculation. They may not be the only symptom, so just getting rid of spikes alone is not interesting. For example, it might turn out that the apparent worsening of Q in the month before release had the same cause: merely that the number of testers was increasing all the while.

The error tracker was introduced in Ubuntu 12.04, so there is no data for 11.10.

We do not use running averages of anything.

Revision history for this message

Jef Spaleta (jspaleta) wrote on 2012-10-27:

I give up. I hope another trained mathematician takes up your call for help. Good luck.

-jef

Evan (ev) on 2012-11-28

Changed in errors:
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Florian W. (florian-will) wrote on 2014-07-26:

I'm not working for canonical / the error tracker. I'm also not a trained mathematician. :-) However, I was a little puzzled as a user when I looked at the "This helps measure reliability of …" graph on e.u.c. (It doesn't even indicate what kind of data is plotted there, I actually thought higher = better because higher reliability = better, so I wondered why realiability seems to be plumetting recently).

So after reading this bug, it seems to attempt to indicate the "average number of errors an Ubuntu user encounters on a single day" but fails to do so correctly. I think Jef Spaleta's point is comprehensible. Trying to deduce the number of machines with error reporting capability by measuring the number of different machines that reported errors recently,will inherently fail early on and any time a lot of new machines are added/removed from the system, no matter what kind of magic is applied to hide that problem. I can't prove that though. :-) The offer to "generate a spectral filter" (whatever that means :-) ) sounds like it would take into account that a newly released distro is expected to have a higher faked-errors-per-day value and "normalize" the displayed numbers over time, but that probably requires speed of adaption to new distro releases to be about the same for each cycle, which I doubt.

The way to reliably determine the number of machines with error reporting capability… is to count them obiously, so the machines need to send an "I encountered no errors today but would have reported the error if I encountered one" message regularly. As a user, I'm don't think this is a privacy issue: As an experienced Ubuntu user, I expect the next crash report to be sent within ~10 days or so, and at that time my machine will be counted anyway. So just send that ping if error reporting is enabled.

If that's not possible, the popcon idea sounds okay, IMO. If popcon is a reliable measurement of total install base, and the ratio "machines capable of sending error reports" / "total install base" changes rather slowly over time, then the number of machines capable of sending error reports can be calculated from the total install base even though there is a spike in the total install base (only if popcon is able to register that spike quickly, of course). Maybe the ratio error-reporting-machines / popcon-total-install-base could be calculated and updated regularly using the old 90 day method for a distro version that has been released more than 90 days ago.

So after reading this bug, it seems to attempt to indicate the "average number of errors an Ubuntu user encounters on a single day" but fails to do so correctly. I think Jef Spaleta's point is comprehensible. Trying to deduce the number of machines with error reporting capability by measuring the number of different  machines that reported errors recently,will inherently fail early on and any time a lot of new machines are added/removed from the system, no matter what kind of magic is applied to hide that problem. I can't prove that though. :-) The offer to "generate a spectral filter" (whatever that means :-) ) sounds like it would take into account that a newly released distro is expected to have a higher faked-errors-per-day value and "normalize" the displayed numbers over time, but that probably requires speed of adaption to new distro releases to be about the same for each cycle, which I doubt.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.