Discussion:
Policy for disabling tests which run on TBPL
jmaher
2014-04-04 18:58:28 UTC
Permalink
As the sheriff's know it is frustrating to deal with hundreds of tests that fail on a daily basis, but are intermittent.

When a single test case is identified to be leaking or failing at least 10% of the time, it is time to escalate.

Escalation path:
1) Ensure we have a bug on file, with the test author, reviewer, module owner, and any other interested parties, links to logs, etc.
2) We need to needinfo? and expect a response within 2 business days, this should be clear in a comment.
3) In the case we don't get a response, request a needinfo? from the module owner
with the expectation of 2 days for a response and getting someone to take action.
4) In the case we go another 2 days with no response from a module owner, we will disable the test.

Ideally we will work with the test author to either get the test fixed or disabled depending on available time or difficulty in fixing the test.

This is intended to respect the time of the original test authors by not throwing emergencies in their lap, but also strike a balance with keeping the trees manageable.

Two exceptions:
1) If a test is failing at least 50% of the time, we will file a bug and disable the test first
2) When we are bringing a new platform online (Android 2.3, b2g, etc.) many tests will need to be disabled prior to getting the tests on tbpl.

Please comment if this doesn't make sense.

-Joel
L. David Baron
2014-04-04 19:12:33 UTC
Permalink
Post by jmaher
As the sheriff's know it is frustrating to deal with hundreds of tests that fail on a daily basis, but are intermittent.
When a single test case is identified to be leaking or failing at least 10% of the time, it is time to escalate.
1) Ensure we have a bug on file, with the test author, reviewer, module owner, and any other interested parties, links to logs, etc.
2) We need to needinfo? and expect a response within 2 business days, this should be clear in a comment.
3) In the case we don't get a response, request a needinfo? from the module owner
with the expectation of 2 days for a response and getting someone to take action.
4) In the case we go another 2 days with no response from a module owner, we will disable the test.
Are you talking about newly-added tests, or tests that have been
passing for a long time and recently started failing?

In the latter case, the burden should fall on the regressing patch,
and the regressing patch should get backed out instead of disabling
the test.
Post by jmaher
Ideally we will work with the test author to either get the test fixed or disabled depending on available time or difficulty in fixing the test.
This is intended to respect the time of the original test authors by not throwing emergencies in their lap, but also strike a balance with keeping the trees manageable.
If this plan is applied to existing tests, then it will lead to
style system mochitests being turned off due to other regressions
because I'm the person who wrote them and the module owner, and I
don't always have time to deal with regressions in other parts of
code (e.g., the JS engine) leading to these tests failing
intermittently.

If that happens, we won't have the test coverage we need to add new
CSS properties or values.

More generally, it places a much heavier burden on contributors who
have been part of the project longer, who are also likely to be
overburdened in other ways (e.g., reviews). That's why the burden
needs to be placed on the regressing change rather than the original
author of the test.
Post by jmaher
1) If a test is failing at least 50% of the time, we will file a bug and disable the test first
These 10% and 50% numbers don't feel right to me; I think the
thresholds should probably be substantially lower. But I think it's
easier to think about these numbers in failures/day, at least for
me.
Post by jmaher
2) When we are bringing a new platform online (Android 2.3, b2g, etc.) many tests will need to be disabled prior to getting the tests on tbpl.
That's reasonable as long as work is done to try to get the tests
enabled (at a minimum, actually enabling all the tests that are
passing reliably, rather than stopping after enabling the passing
tests in only some directories).

-David
--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)
Ehsan Akhgari
2014-04-04 19:44:10 UTC
Permalink
Post by L. David Baron
Post by jmaher
As the sheriff's know it is frustrating to deal with hundreds of tests that fail on a daily basis, but are intermittent.
When a single test case is identified to be leaking or failing at least 10% of the time, it is time to escalate.
1) Ensure we have a bug on file, with the test author, reviewer, module owner, and any other interested parties, links to logs, etc.
2) We need to needinfo? and expect a response within 2 business days, this should be clear in a comment.
3) In the case we don't get a response, request a needinfo? from the module owner
with the expectation of 2 days for a response and getting someone to take action.
4) In the case we go another 2 days with no response from a module owner, we will disable the test.
Are you talking about newly-added tests, or tests that have been
passing for a long time and recently started failing?
In the latter case, the burden should fall on the regressing patch,
and the regressing patch should get backed out instead of disabling
the test.
We have no good way of identifying the regressing patch in many cases.
Post by L. David Baron
Post by jmaher
Ideally we will work with the test author to either get the test fixed or disabled depending on available time or difficulty in fixing the test.
This is intended to respect the time of the original test authors by not throwing emergencies in their lap, but also strike a balance with keeping the trees manageable.
If this plan is applied to existing tests, then it will lead to
style system mochitests being turned off due to other regressions
because I'm the person who wrote them and the module owner, and I
don't always have time to deal with regressions in other parts of
code (e.g., the JS engine) leading to these tests failing
intermittently.
If that happens, we won't have the test coverage we need to add new
CSS properties or values.
More generally, it places a much heavier burden on contributors who
have been part of the project longer, who are also likely to be
overburdened in other ways (e.g., reviews). That's why the burden
needs to be placed on the regressing change rather than the original
author of the test.
Post by jmaher
1) If a test is failing at least 50% of the time, we will file a bug and disable the test first
These 10% and 50% numbers don't feel right to me; I think the
thresholds should probably be substantially lower. But I think it's
easier to think about these numbers in failures/day, at least for
me.
Post by jmaher
2) When we are bringing a new platform online (Android 2.3, b2g, etc.) many tests will need to be disabled prior to getting the tests on tbpl.
That's reasonable as long as work is done to try to get the tests
enabled (at a minimum, actually enabling all the tests that are
passing reliably, rather than stopping after enabling the passing
tests in only some directories).
One thing I want to note with my "random guy who has been fixing tests
all around the code base for a few years" hat on: I think we
collectively have been terrible at acting on intermittent failures. I
see this over and over again where someone is needinfo?ed or assigned to
an orange bug and they *never* comment or interact with the bug in any
way. That is perhaps a cultural problem which we need to address, and
I'm sure in a lot of cases is because the said person is busy with more
important things, but it has one important implication that in a lot of
the cases, the sheriffs basically have no way of knowing whether anyone
ever is gong to work on the test failure in question. I really really
hate having to disable tests, but I'm entirely convinced that if we did
not do that very aggressively, the current tree would basically be
perma-orange. And that makes it impossible for anyone to check in anything.

So please try to keep that in mind when thinking about this proposal.

Cheers,
Ehsan
Gavin Sharp
2014-04-04 20:19:48 UTC
Permalink
Post by L. David Baron
Post by jmaher
1) Ensure we have a bug on file, with the test author, reviewer, module owner, and any other interested parties, links to logs, etc.
2) We need to needinfo? and expect a response within 2 business days, this should be clear in a comment.
3) In the case we don't get a response, request a needinfo? from the module owner
with the expectation of 2 days for a response and getting someone to take action.
4) In the case we go another 2 days with no response from a module owner, we will disable the test.
In the latter case, the burden should fall on the regressing patch,
and the regressing patch should get backed out instead of disabling
the test.
The majority of the time identifying the regressing patch is
difficult, and something that requires help from someone who can debug
the test. Hence step 1). Steps 2) and 3) are an escalation path that
might result in disabling the test or fixing it, but the primary
purpose is to ensure that decision is reached with full visibility
from the relevant owners/developers.
Post by L. David Baron
If this plan is applied to existing tests, then it will lead to
style system mochitests being turned off due to other regressions
because I'm the person who wrote them and the module owner, and I
don't always have time to deal with regressions in other parts of
code (e.g., the JS engine) leading to these tests failing
intermittently.
That might be the right call. That decision needs to not be reached
only by sheriffs, which is occasionally the case now. This proposed
policy helps ensure that the module owner has at least an opportunity
to influence the decision about how to proceed with tests in their
module.

If a style system test is failing intermittently enough that it's a
significant burden on sheriffs, and you can't find someone to look
into it in a reasonable time frame, then disabling it is probably the
only option. The judgement call about "intermittently enough" and the
relative tradeoff of sheriff burden vs. test coverage is one that
needs to be made by a combination of you and sheriffs, and that's
really all that this policy is meant to enforce.

Gavin
Chris Peterson
2014-04-04 20:30:41 UTC
Permalink
Post by Gavin Sharp
The majority of the time identifying the regressing patch is
difficult
Identifying the regressing patch is only difficult because we have so
many intermittently failing tests.

Intermittent oranges are one of the major blockers for Autoland. If TBPL
never shows an entirely green test pass on mozilla-inbound, Autoland
can't programmatically determine when a checkin-need patch can be safely
landed.


chris
Jonathan Griffin
2014-04-04 20:58:24 UTC
Permalink
With respect to Autoland, I think we'll need to figure out how to make
it take intermittents into account. I don't think we'll ever be a state
with 0 intermittents.

Jonathan
Post by Chris Peterson
Post by Gavin Sharp
The majority of the time identifying the regressing patch is
difficult
Identifying the regressing patch is only difficult because we have so
many intermittently failing tests.
Intermittent oranges are one of the major blockers for Autoland. If
TBPL never shows an entirely green test pass on mozilla-inbound,
Autoland can't programmatically determine when a checkin-need patch
can be safely landed.
chris
_______________________________________________
dev-platform mailing list
https://lists.mozilla.org/listinfo/dev-platform
Ehsan Akhgari
2014-04-04 21:02:30 UTC
Permalink
Post by Jonathan Griffin
With respect to Autoland, I think we'll need to figure out how to make
it take intermittents into account. I don't think we'll ever be a state
with 0 intermittents.
That's not true, we were in that state once, before I stopped working on
this issue. We can get there again if we wanted to. It's just a lot of
hard work which won't scale if we only have one person doing it.
Martin Thomson
2014-04-04 21:21:54 UTC
Permalink
That's not true, we were in that state once, before I stopped working on this issue. We can get there again if we wanted to. It's just a lot of hard work which won't scale if we only have one person doing it.
It’s self-correcting too. Turn on auto land and require a green build to land a patch. I’m certain that the number of intermittent failures will drop dramatically.
Chris Peterson
2014-04-04 22:15:55 UTC
Permalink
Post by Martin Thomson
That's not true, we were in that state once, before I stopped working on this issue. We can get there again if we wanted to. It's just a lot of hard work which won't scale if we only have one person doing it.
It’s self-correcting too. Turn on auto land and require a green build to land a patch. I’m certain that the number of intermittent failures will drop dramatically.
Code in mozilla-central has passed a human sheriff's analysis of TBPL
results from mozilla-inbound. Someone could continuously run tests on a
good snapshot of mozilla-central (or the well-tested mozilla-release
branch) for 24 hours to determine the degree that individual test cases
are solid green or shades of orange. Draw a line in the sand declaring
the solid greens as never intermittent and must-be-green on
mozilla-inbound. After fixing all the oranges, rinse and repeat with
roc's Chaos Mode enabled! :)

Note that this suggestion does not require mozilla-central to be frozen
from checkins.


chris
Ehsan Akhgari
2014-04-04 21:00:31 UTC
Permalink
Post by Chris Peterson
Post by Gavin Sharp
The majority of the time identifying the regressing patch is
difficult
Identifying the regressing patch is only difficult because we have so
many intermittently failing tests.
Intermittent oranges are one of the major blockers for Autoland. If TBPL
never shows an entirely green test pass on mozilla-inbound, Autoland
can't programmatically determine when a checkin-need patch can be safely
landed.
Note that is only accurate to a certain point. There are other things
which we can do to guesswork our way out of the situation for Autoland,
but of course they're resource/time intensive (basically running orange
tests over and over again, etc.)
Aryeh Gregor
2014-04-06 12:59:41 UTC
Permalink
Note that is only accurate to a certain point. There are other things which
we can do to guesswork our way out of the situation for Autoland, but of
course they're resource/time intensive (basically running orange tests over
and over again, etc.)
Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail? This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.
Ehsan Akhgari
2014-04-06 13:58:24 UTC
Permalink
Post by Aryeh Gregor
Note that is only accurate to a certain point. There are other things which
we can do to guesswork our way out of the situation for Autoland, but of
course they're resource/time intensive (basically running orange tests over
and over again, etc.)
Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail? This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.
I don't think there is a good reason not to do that. Please file a bug!

Cheers,
Ehsan
Ed Morley
2014-04-06 14:47:03 UTC
Permalink
Post by Ehsan Akhgari
Post by Aryeh Gregor
Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail?
I don't think there is a good reason not to do that. Please file a bug!
xpcshell already does this, bug 918921 is filed for mochitests, though
dbaron raises valid concerns in the bug.

That said, I think something that would be worth doing (and is now more
viable now that we've nearing completion for switching to manifests for
tests) is having a 'mochitest-flaky' where tests that are marked in the
manifest as being unreliable are run. (Similar to what the Chomium
project currently does).

Cheers,

Ed
Andrew Halberstadt
2014-04-07 03:33:09 UTC
Permalink
Post by Aryeh Gregor
Note that is only accurate to a certain point. There are other things which
we can do to guesswork our way out of the situation for Autoland, but of
course they're resource/time intensive (basically running orange tests over
and over again, etc.)
Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail? This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.
Many of our test runners have that ability. But doing this implies that
intermittents are always the fault of the test. We'd be missing whole
classes of regressions (notably race conditions).
James Graham
2014-04-07 09:10:07 UTC
Permalink
Post by Andrew Halberstadt
On Sat, Apr 5, 2014 at 12:00 AM, Ehsan Akhgari
Note that is only accurate to a certain point. There are other things which
we can do to guesswork our way out of the situation for Autoland, but of
course they're resource/time intensive (basically running orange tests over
and over again, etc.)
Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail? This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.
Many of our test runners have that ability. But doing this implies that
intermittents are always the fault of the test. We'd be missing whole
classes of regressions (notably race conditions).
In practice how effective are we at identifying bugs that lead to
instability? Is it more common that we end up disabling the test, or
marking it as "known intermittent" and learning to live with the
instability, both of which options reduce the test coverage, or is it
more common that we realise that there is a code reason for the
intermittent, and get it fixed?

If it is the latter then making the instability as obvious as possible
makes sense, and the current setup where we run each test once can be
regarded as a compromise between the ideal setup where we run each test
multiple times and flag it as a fail if it ever fails, and the needs of
performance.

If the former is true, it makes a lot more sense to do reruns of the
tests that fail in order to keep them active at all, and store
information about the fact that reruns occurred so that we can see when
a test started giving unexpected results. This does rely on having some
mechanism to make people care about genuine intermittents that they
caused, but maybe the right way to do that is to have some batch tool
that takes all the tests that have become intermittent, and does reruns
until it has identified the commits that introduced the intermittency,
and then files P1 bugs on the developer(s) it identifies to fix their bugs.
Aryeh Gregor
2014-04-07 11:53:57 UTC
Permalink
On Mon, Apr 7, 2014 at 6:33 AM, Andrew Halberstadt
Post by Andrew Halberstadt
Many of our test runners have that ability. But doing this implies that
intermittents are always the fault of the test. We'd be missing whole
classes of regressions (notably race conditions).
We already are, because we already will star (i.e., ignore) any
failure that looks like a known intermittent failure. I'm only saying
we should automate that as much as possible.
Andrew Halberstadt
2014-04-07 12:20:48 UTC
Permalink
Post by James Graham
Post by Andrew Halberstadt
Post by Aryeh Gregor
Is there any reason in principle that we couldn't have the test runner
automatically rerun tests with known intermittent failures a few
times, and let the test pass if it passes a few times in a row after
the first fail? This would be a much nicer option than disabling the
test entirely, and would still mean the test is mostly effective,
particularly if only specific failure messages are allowed to be
auto-retried.
Many of our test runners have that ability. But doing this implies that
intermittents are always the fault of the test. We'd be missing whole
classes of regressions (notably race conditions).
In practice how effective are we at identifying bugs that lead to
instability? Is it more common that we end up disabling the test, or
marking it as "known intermittent" and learning to live with the
instability, both of which options reduce the test coverage, or is it
more common that we realise that there is a code reason for the
intermittent, and get it fixed?
I would guess the former is true in most cases. But at least there we
have a *chance* at tracking down and fixing the failure, even if it
takes awhile before it becomes annoying enough to prioritize. If we made
it so intermittents never annoyed anyone, there would be even less
motivation to fix them. Yes in theory we would still have a list of top
failing intermittents. In practice that list will be ignored.

Case in point, desktop xpcshell does this right now. Open a log and
ctrl-f for "Retrying tests". Most runs have a few failures that got
retried. No one knows about these and no one looks at them. Publishing
results somewhere easy to discover would definitely help, but I'm not
convinced it will help much.

Doing this would also cause us to miss non-intermittent regressions, e.g
where the ordering of tests tickles the platform the wrong way. On the
retry, the test would get run in a completely different order and might
show up green 100% of the time.

Either way, the problem is partly culture, partly due to not good enough
tooling. I see where this proposal is coming from, but I think there are
ways of tackling the problem head on. This seems kind of like a last resort.

Andrew
Aryeh Gregor
2014-04-07 13:02:21 UTC
Permalink
On Mon, Apr 7, 2014 at 3:20 PM, Andrew Halberstadt
I would guess the former is true in most cases. But at least there we have a
*chance* at tracking down and fixing the failure, even if it takes awhile
before it becomes annoying enough to prioritize. If we made it so
intermittents never annoyed anyone, there would be even less motivation to
fix them. Yes in theory we would still have a list of top failing
intermittents. In practice that list will be ignored.
Is this better or worse than the status quo? Just because a bug
happens to have made its way into our test suite doesn't mean it
should be high priority. If the bug isn't causing known problems for
users, it makes sense to ignore it in favor of working on bugs that
are known to affect users. Why not let the relevant developers make
that prioritization decision, and ignore the bug forever if they don't
think it's as important as other things they're working on?
Ted Mielczarek
2014-04-07 15:12:17 UTC
Permalink
Post by Aryeh Gregor
On Mon, Apr 7, 2014 at 3:20 PM, Andrew Halberstadt
I would guess the former is true in most cases. But at least there we have a
*chance* at tracking down and fixing the failure, even if it takes awhile
before it becomes annoying enough to prioritize. If we made it so
intermittents never annoyed anyone, there would be even less motivation to
fix them. Yes in theory we would still have a list of top failing
intermittents. In practice that list will be ignored.
Is this better or worse than the status quo? Just because a bug
happens to have made its way into our test suite doesn't mean it
should be high priority. If the bug isn't causing known problems for
users, it makes sense to ignore it in favor of working on bugs that
are known to affect users. Why not let the relevant developers make
that prioritization decision, and ignore the bug forever if they don't
think it's as important as other things they're working on?
If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.

It's difficult to say whether bugs we find via tests are more or less
important than bugs we find via users. It's entirely possible that lots
of the bugs that cause intermittent test failures cause intermittent
weird behavior for our users, we simply don't have any visibility into
that.

-Ted
Aryeh Gregor
2014-04-07 15:49:58 UTC
Permalink
Post by Ted Mielczarek
If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.
To some degree, yes, marking a test as expected intermittent causes it
to lose value. If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so. That should be their
decision, not something forced on them by infrastructure issues
("everyone else will suffer if you don't find the cause for this
failure in your test"). Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.
Ehsan Akhgari
2014-04-07 23:41:13 UTC
Permalink
Post by Aryeh Gregor
Post by Ted Mielczarek
If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.
To some degree, yes, marking a test as expected intermittent causes it
to lose value. If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so. That should be their
decision, not something forced on them by infrastructure issues
("everyone else will suffer if you don't find the cause for this
failure in your test"). Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.
What you're saying above is true *if* someone investigates the
intermittent test failure and determines that the bug is not important.
But in my experience, that's not what happens at all. I think many
people treat intermittent test failures as a category of unimportant
problems, and therefore some bugs are never investigated. The fact of
the matter is that most of these bugs are bugs in our tests, which of
course will not impact our users directly, but I have occasionally come
across bugs in our code code which are exposed as intermittent failures.
The real issue is that the work of identifying where the root of the
problem is often time is the majority of work needed to fix the
intermittent test failure, so unless someone is willing to investigate
the bug we cannot say whether or not it impacts our users.

The thing that really makes me care about these intermittent failures a
lot is that ultimately they make us have to trade either disabling a
whole bunch of tests with being unable to manage our tree. As more and
more tests get disabled, we lose more and more test coverage, and that
can have a much more severe impact on the health of our products than
every individual intermittent test failure.

Cheers,
Ehsan
Aryeh Gregor
2014-04-08 12:15:35 UTC
Permalink
What you're saying above is true *if* someone investigates the intermittent
test failure and determines that the bug is not important. But in my
experience, that's not what happens at all. I think many people treat
intermittent test failures as a category of unimportant problems, and
therefore some bugs are never investigated. The fact of the matter is that
most of these bugs are bugs in our tests, which of course will not impact
our users directly, but I have occasionally come across bugs in our code
code which are exposed as intermittent failures. The real issue is that the
work of identifying where the root of the problem is often time is the
majority of work needed to fix the intermittent test failure, so unless
someone is willing to investigate the bug we cannot say whether or not it
impacts our users.
The same is true for many bugs. The reported symptom might indicate a
much more extensive underlying problem. The fact is, though,
thoroughly investigating every bug would take a ton of resources, and
is almost certainly not the best use of our manpower. There are many
bugs that are *known* to affect many users that don't get fixed in a
timely fashion. Things that probably won't affect a single user ever
at all, and which are likely to be a pain to track down (because
they're intermittent), should be prioritized relatively low.
The thing that really makes me care about these intermittent failures a lot
is that ultimately they make us have to trade either disabling a whole bunch
of tests with being unable to manage our tree. As more and more tests get
disabled, we lose more and more test coverage, and that can have a much more
severe impact on the health of our products than every individual
intermittent test failure.
I think you hit the nail on the head, but I think there's a third
solution: automatically ignore known intermittent failures, in as
fine-grained a way as possible. This means the test is still almost
as useful -- I think the vast majority of our tests will fail
consistently if the thing they're testing breaks, not fail
intermittently. But it doesn't get in the way of managing the tree.
Yes, it reduces some tests' value slightly relative to fixing them,
but it's not a good use of our resources to try tracking down most
intermittent failures. The status quo reduces those tests' value just
as much as automatic ignoring (because people will star known failure
patterns consistently), but imposes a large manual labor cost.
Ehsan Akhgari
2014-04-08 14:12:29 UTC
Permalink
Post by Aryeh Gregor
What you're saying above is true *if* someone investigates the intermittent
test failure and determines that the bug is not important. But in my
experience, that's not what happens at all. I think many people treat
intermittent test failures as a category of unimportant problems, and
therefore some bugs are never investigated. The fact of the matter is that
most of these bugs are bugs in our tests, which of course will not impact
our users directly, but I have occasionally come across bugs in our code
code which are exposed as intermittent failures. The real issue is that the
work of identifying where the root of the problem is often time is the
majority of work needed to fix the intermittent test failure, so unless
someone is willing to investigate the bug we cannot say whether or not it
impacts our users.
The same is true for many bugs. The reported symptom might indicate a
much more extensive underlying problem. The fact is, though,
thoroughly investigating every bug would take a ton of resources, and
is almost certainly not the best use of our manpower. There are many
bugs that are *known* to affect many users that don't get fixed in a
timely fashion. Things that probably won't affect a single user ever
at all, and which are likely to be a pain to track down (because
they're intermittent), should be prioritized relatively low.
I don't think that an analogy with normal bugs is accurate here. These
intermittent failure bugs are categorically treated differently than all
other incoming bugs in my experience.
Post by Aryeh Gregor
The thing that really makes me care about these intermittent failures a lot
is that ultimately they make us have to trade either disabling a whole bunch
of tests with being unable to manage our tree. As more and more tests get
disabled, we lose more and more test coverage, and that can have a much more
severe impact on the health of our products than every individual
intermittent test failure.
I think you hit the nail on the head, but I think there's a third
solution: automatically ignore known intermittent failures, in as
fine-grained a way as possible. This means the test is still almost
as useful -- I think the vast majority of our tests will fail
consistently if the thing they're testing breaks, not fail
intermittently. But it doesn't get in the way of managing the tree.
Yes, it reduces some tests' value slightly relative to fixing them,
but it's not a good use of our resources to try tracking down most
intermittent failures. The status quo reduces those tests' value just
as much as automatic ignoring (because people will star known failure
patterns consistently), but imposes a large manual labor cost.
I agree that automatically ignoring known intermittent failures that
have been marked as such by a human is a good idea.

But let's also not forget that it won't be a one size fits all solution.
There are test failure scenarios such as timeouts and crashes which we
can't easily retry (for timeouts because the test may leave its
environment in a non-clean state). There will also be cases where
reloading the same test will actually test different things (because,
for example, things have been cached, etc.).

Cheers,
Ehsan
Karl Tomlinson
2014-04-08 22:10:08 UTC
Permalink
Post by Aryeh Gregor
Post by Ehsan Akhgari
What you're saying above is true *if* someone investigates the
intermittent test failure and determines that the bug is not
important. But in my experience, that's not what happens at
all. I think many people treat intermittent test failures as a
category of unimportant problems, and therefore some bugs are
never investigated. The fact of the matter is that most of
these bugs are bugs in our tests, which of course will not
impact our users directly, but I have occasionally come across
bugs in our code code which are exposed as intermittent
failures. The real issue is that the work of identifying where
the root of the problem is often time is the majority of work
needed to fix the intermittent test failure, so unless someone
is willing to investigate the bug we cannot say whether or not
it impacts our users.
The same is true for many bugs. The reported symptom might
indicate a much more extensive underlying problem. The fact is,
though, thoroughly investigating every bug would take a ton of
resources, and is almost certainly not the best use of our
manpower. There are many bugs that are *known* to affect many
users that don't get fixed in a timely fashion. Things that
probably won't affect a single user ever at all, and which are
likely to be a pain to track down (because they're
intermittent), should be prioritized relatively low.
New intermittent failures are different from many user reported
bugs because they are known to be a regression and there is some
kind of indication of the regression window.

Regressions should be high priority. People are getting by
without many new features but people have begun to depend on
existing features, so regressions break real sites and cause
confusion for many people.

The time to address regressions is ASAP, so that responsibility
can be handed over to the person causing the regression. Waiting
too long means that backing out the cause of the regression is
likely to cause another regression.

I wonder whether the real problem here is that we have too many
bad tests that report false negatives, and these bad tests are
reducing the value of our testsuite in general. Tests also need
to be well documented so that people can understand what a
negative report really means. This is probably what is leading to
assumptions that disabling a test is the solution to a new
failure.

Getting bugs on file and seen by the right people is an important
part of dealing with this. The tricky part is working out how to
prioritize and cope with these bugs.
Ehsan Akhgari
2014-04-09 14:02:59 UTC
Permalink
Post by Karl Tomlinson
I wonder whether the real problem here is that we have too many
bad tests that report false negatives, and these bad tests are
reducing the value of our testsuite in general. Tests also need
to be well documented so that people can understand what a
negative report really means. This is probably what is leading to
assumptions that disabling a test is the solution to a new
failure.
I think this is a great point! Yes, we absolutely have a lot of bad
tests which are written in a way that is prone to intermittent failures.
This means that a number of changes to something entirely irrelevant
to the test in question can change its environment to a point where a
remnant intermittent failure issue may suddenly start to show up.
Sometimes for some of these tests that are not fixed in time the
situation is even more confusing: a change in the environment causes it
to start to fail, then another change causes it to stop failing, and
then a while later another change may cause it to start failing again.

I think it would be very valuable to detect some of the known conditions
[1] that may result in a test intermittently failing because of timing
issues in our test harnesses to prevent from those badly written tests
from getting checked in in the first place. I've previously tried to do
some of this work [2], and it would be fantastic if someone can pick up
the ball there.

[1]
https://developer.mozilla.org/en-US/docs/Mozilla/QA/Avoiding_intermittent_oranges
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=649012

Cheers,
Ehsan
Mike Hoye
2014-04-07 15:53:55 UTC
Permalink
Post by Ted Mielczarek
It's difficult to say whether bugs we find via tests are more or less
important than bugs we find via users. It's entirely possible that
lots of the bugs that cause intermittent test failures cause
intermittent weird behavior for our users, we simply don't have any
visibility into that.
Is there a way - or could there be a way - for us to push builds that
generate intermittent-failure tests out to the larger Mozilla community?

And, more generally: I'd love it if we could economically tie in
community engagement to our automated test process somehow.

- mhoye
Andrew Halberstadt
2014-04-08 13:43:38 UTC
Permalink
Post by Aryeh Gregor
Post by Ted Mielczarek
If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.
To some degree, yes, marking a test as expected intermittent causes it
to lose value. If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so. That should be their
decision, not something forced on them by infrastructure issues
("everyone else will suffer if you don't find the cause for this
failure in your test"). Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.
I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.

As I mentioned in another post in this thread, we need better data and
easier ways to drill through it. All I'm saying here is that I think
things are probably worse than you picture them, and I think there is a
lot of groundwork needed before it even makes sense to consider this.

-Andrew
James Graham
2014-04-08 13:51:21 UTC
Permalink
Post by Andrew Halberstadt
Post by Aryeh Gregor
Post by Ted Mielczarek
If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.
To some degree, yes, marking a test as expected intermittent causes it
to lose value. If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so. That should be their
decision, not something forced on them by infrastructure issues
("everyone else will suffer if you don't find the cause for this
failure in your test"). Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.
I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.
So, what's the minimum level of infrastructure that you think would be
needed to go ahead with this plan? To me it seems like the current
system already isn't working very well, so the bar for moving forward
with a plan that would increase the amount of data we had available to
diagnose problems with intermittents, and reduce the amount of manual
labour needed in marking them, should be quite low.
Ehsan Akhgari
2014-04-08 14:06:55 UTC
Permalink
Post by James Graham
Post by Andrew Halberstadt
Post by Aryeh Gregor
Post by Ted Mielczarek
If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.
To some degree, yes, marking a test as expected intermittent causes it
to lose value. If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so. That should be their
decision, not something forced on them by infrastructure issues
("everyone else will suffer if you don't find the cause for this
failure in your test"). Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.
I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.
So, what's the minimum level of infrastructure that you think would be
needed to go ahead with this plan? To me it seems like the current
system already isn't working very well, so the bar for moving forward
with a plan that would increase the amount of data we had available to
diagnose problems with intermittents, and reduce the amount of manual
labour needed in marking them, should be quite low.
dbaron raised the point that there are tests which are supposed to fail
intermittently if they detect a bug. With that in mind, the tests
cannot be marked as intermittently failing by the sheriffs, less so in
an automated way (see the discussion in bug 918921).

I'm still not convinced that this idea will be worse than the status
quo, but the fact that dbaron doesn't agree makes me hesitate.

But to answer your question, I think this is something which can be done
in the test harness itself so we don't need any special infra support
for it. Note that I don't think that automatically marking such tests
is a good idea either way.

Cheers,
Ehsan
James Graham
2014-04-08 14:32:35 UTC
Permalink
Post by Ehsan Akhgari
Post by James Graham
Post by Andrew Halberstadt
Post by Aryeh Gregor
Post by Ted Mielczarek
If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.
To some degree, yes, marking a test as expected intermittent causes it
to lose value. If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so. That should be their
decision, not something forced on them by infrastructure issues
("everyone else will suffer if you don't find the cause for this
failure in your test"). Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.
I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.
So, what's the minimum level of infrastructure that you think would be
needed to go ahead with this plan? To me it seems like the current
system already isn't working very well, so the bar for moving forward
with a plan that would increase the amount of data we had available to
diagnose problems with intermittents, and reduce the amount of manual
labour needed in marking them, should be quite low.
dbaron raised the point that there are tests which are supposed to fail
intermittently if they detect a bug. With that in mind, the tests
cannot be marked as intermittently failing by the sheriffs, less so in
an automated way (see the discussion in bug 918921).
Such tests are problematic indeed, but it seems like they're problematic
in the current infrastructure too. For example if a test goes from
always passing to failing 1 time in 10 when it regresses, the first time
we see the regression is likely to be around 10 testruns after the
problem is introduced. That presumably makes it rather hard to track
down what when things went wrong. Or are we running such tests N times
where N is some high enough number that we are confident that the test
has a 95% (or whatever) chance of failing if there is actually a
regression? If not maybe we should be. Or perhaps the idea of
independent testruns isn't useful in the face of all the state we have.

In any case this kind of test could be explicitly excluded from the
reruns, which would make the situation the same as it is today.
Post by Ehsan Akhgari
But to answer your question, I think this is something which can be done
in the test harness itself so we don't need any special infra support
for it. Note that I don't think that automatically marking such tests
is a good idea either way.
The infra support I had in mind was something like "automatically (doing
something like) starring tests that only passed after being rerun" or
"listing all tests that needed a rerun" or "having a tool to find the
first build in which the test became intermittent". The goal of this
extra infrastructure would be to get the new information about reruns
out of the testharness and address the concern that doing automated
reruns would mean people paying even less attention to intermittents
than they do today.
L. David Baron
2014-04-08 17:20:10 UTC
Permalink
Post by James Graham
So, what's the minimum level of infrastructure that you think would
be needed to go ahead with this plan? To me it seems like the
current system already isn't working very well, so the bar for
moving forward with a plan that would increase the amount of data we
had available to diagnose problems with intermittents, and reduce
the amount of manual labour needed in marking them, should be quite
low.
Not sure what plan you're talking about, but:

The first step I'd like to see is having better tools for finding
where known intermittent failures regressed. In particular, we
should have:
* the ability to retrigger a partial test run (not the whole
suite) on our testing infrastructure. This doesn't always help,
since some failures will happen only in the context of the whole
suite, but I think it's likely to help most of the time.
* auto-bisection tools for intermittent failures that use the above
ability when they can

I think we're pretty good about backing out changesets that cause
new intermittent failures that happen at ~20% or more failure rates.
We need to get better about backing out for new intermittent
failures that are intermittent at lower rates, and being able to do
that is best done with better tools.


(One piece of context I'm coming from: there have been multiple
times that the tests that I consider necessary to have enabled to
allow people to add new CSS properties or values have failed
intermittently at a reasonably high rate for a few months; I think
both the start and end of these periods of failures has, in the
cases where we found it, correlated with major or minor changes to
the Javascript JIT. I think those JIT bugs, if they shipped, were
likely causing real problems for users, and we should be detecting
those bugs rather than disabling our CSS testing coverage and
putting us in a state where we can't add new CSS features.)


I also don't think that moving the failure threshold is a long-term
solution. There will always be tests that hover on the edge of
whatever the failure threshold is and give us trouble as a result; I
think moving the threshold will only give temporary relief due to
the history of writing tests to a stricter standard. For example,
if we retry intermittent failures up to 10 times to see if they
pass, we'll end up with tests that fail 75% of the time and thus
fail all 10 retries intermittently (5.6% of the time).

-David
--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)
Gavin Sharp
2014-04-08 18:41:05 UTC
Permalink
I see only two real goals for the proposed policy:
- ensure that module owners/peers have the opportunity to object to
any "disable test" decisions before they take effect
- set an expectation that intermittent orange failures are dealt with
promptly ("dealt with" first involves investigation, usually by a
developer familiar with the code, and can then lead to either them
being fixed, disabled, or ignored)

Neither of those happen reliably today. Sheriffs are failing to get
the help they need to investigate failures, which leads loss of
(sometimes quite important) test coverage when they decide to
unilaterally disable the relevant tests. Sheriffs should not be
disabling tests unilaterally; developers should not be ignoring
sheriff requests to investigate failures.

The policy is not intended to suggest that any particular outcome
(i.e. test disabling) is required.

Separately from all of that, we could definitely invest in better
tools for dealing with intermittent failures in general. Anecdotally,
I know chromium has some nice ways of dealing with them, for example.
But I see that a separate discussion not really related to the goals
above.

Gavin
Post by L. David Baron
Post by James Graham
So, what's the minimum level of infrastructure that you think would
be needed to go ahead with this plan? To me it seems like the
current system already isn't working very well, so the bar for
moving forward with a plan that would increase the amount of data we
had available to diagnose problems with intermittents, and reduce
the amount of manual labour needed in marking them, should be quite
low.
The first step I'd like to see is having better tools for finding
where known intermittent failures regressed. In particular, we
* the ability to retrigger a partial test run (not the whole
suite) on our testing infrastructure. This doesn't always help,
since some failures will happen only in the context of the whole
suite, but I think it's likely to help most of the time.
* auto-bisection tools for intermittent failures that use the above
ability when they can
I think we're pretty good about backing out changesets that cause
new intermittent failures that happen at ~20% or more failure rates.
We need to get better about backing out for new intermittent
failures that are intermittent at lower rates, and being able to do
that is best done with better tools.
(One piece of context I'm coming from: there have been multiple
times that the tests that I consider necessary to have enabled to
allow people to add new CSS properties or values have failed
intermittently at a reasonably high rate for a few months; I think
both the start and end of these periods of failures has, in the
cases where we found it, correlated with major or minor changes to
the Javascript JIT. I think those JIT bugs, if they shipped, were
likely causing real problems for users, and we should be detecting
those bugs rather than disabling our CSS testing coverage and
putting us in a state where we can't add new CSS features.)
I also don't think that moving the failure threshold is a long-term
solution. There will always be tests that hover on the edge of
whatever the failure threshold is and give us trouble as a result; I
think moving the threshold will only give temporary relief due to
the history of writing tests to a stricter standard. For example,
if we retry intermittent failures up to 10 times to see if they
pass, we'll end up with tests that fail 75% of the time and thus
fail all 10 retries intermittently (5.6% of the time).
-David
--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)
_______________________________________________
dev-platform mailing list
https://lists.mozilla.org/listinfo/dev-platform
L. David Baron
2014-04-08 18:49:00 UTC
Permalink
Post by Gavin Sharp
- ensure that module owners/peers have the opportunity to object to
any "disable test" decisions before they take effect
- set an expectation that intermittent orange failures are dealt with
promptly ("dealt with" first involves investigation, usually by a
developer familiar with the code, and can then lead to either them
being fixed, disabled, or ignored)
I'm fine with the initial policy proposed at the top of the thread;
this part of the subthread seemed to be about a proposal to
auto-retry failing tests and report them as passing if they
intermittently pass; that's the bit I'm not comfortable with.

-David
--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)
Chris Peterson
2014-04-08 19:15:17 UTC
Permalink
Post by Gavin Sharp
Separately from all of that, we could definitely invest in better
tools for dealing with intermittent failures in general. Anecdotally,
I know chromium has some nice ways of dealing with them, for example.
But I see that a separate discussion not really related to the goals
above.
Is fixing the known intermittent failures part of the plan? :)

Many of the known failures are test timeouts, which suggests some
low-hanging fruit in fixing test or network infrastructure problems:

http://brasstacks.mozilla.com/orangefactor/

chris
Ehsan Akhgari
2014-04-08 19:43:09 UTC
Permalink
Post by Chris Peterson
Post by Gavin Sharp
Separately from all of that, we could definitely invest in better
tools for dealing with intermittent failures in general. Anecdotally,
I know chromium has some nice ways of dealing with them, for example.
But I see that a separate discussion not really related to the goals
above.
Is fixing the known intermittent failures part of the plan? :)
Many of the known failures are test timeouts, which suggests some
http://brasstacks.mozilla.com/orangefactor/
Fixing intermittent timeouts is neither easier or harder than any other
kind of intermittent failure.

Cheers,
Ehsan
Gregory Szorc
2014-04-09 18:00:56 UTC
Permalink
Post by James Graham
Post by Andrew Halberstadt
Post by Aryeh Gregor
Post by Ted Mielczarek
If a bug is causing a test to fail intermittently, then that test loses
value. It still has some value in that it can catch regressions that
cause it to fail permanently, but we would not be able to catch a
regression that causes it to fail intermittently.
To some degree, yes, marking a test as expected intermittent causes it
to lose value. If the developers who work on the relevant component
think the lost value is important enough to track down the cause of
the intermittent failure, they can do so. That should be their
decision, not something forced on them by infrastructure issues
("everyone else will suffer if you don't find the cause for this
failure in your test"). Making known intermittent failures not turn
the tree orange doesn't stop anyone from fixing intermittent failures,
it just removes pressure from them if they decide they don't want to.
If most developers think they have more important bugs to fix, then I
don't see a problem with that.
I think this proposal would make more sense if the state of our
infrastructure and tooling was able to handle it properly. Right now,
automatically marking known intermittents would cause the test to lose
*all* value. It's sad, but the only data we have about intermittents
comes from the sheriffs manually starring them. There is also currently
no way to mark a test KNOWN-RANDOM and automatically detect if it starts
failing permanently. This means the failures can't be starred and become
nearly impossible to discover, let alone diagnose.
So, what's the minimum level of infrastructure that you think would be
needed to go ahead with this plan? To me it seems like the current
system already isn't working very well, so the bar for moving forward
with a plan that would increase the amount of data we had available to
diagnose problems with intermittents, and reduce the amount of manual
labour needed in marking them, should be quite low.
The simple solution is to have a separate in-tree manifest annotation
for intermittents. Put another way, we can describe exactly why we are
not running a test. This is kinda/sorta the realm of bug 922581.

The harder solution is to have some service (like orange factor) keep
track of the state of every test. We can have a feedback loop whereby
test automation queries that service to see what tests should run and
what the expected result is. Of course, we will want that integration to
work locally so we have consistent test execution between automation and
developer machines.

I see us inevitably deploying the harder solution. We'll eventually get
to a point where we're able to do "crazy" things such as intelligently
run only the sub-set of tests impacted by a check-in or attempting to
run disabled tests to see if they magically started working again. I
think we'll eventually realize that tracking this in a central service
makes more sense than doing it in-tree (mainly because of the amount of
data required to make some advanced determinations).

For the short term, I think we should enumerate the reasons we don't run
a test (distinguishing between "test isn't compatible" and "test isn't
working" is important) and annotate these separately in our test
manifests. We can then modify our test automation to treat things
differently. For example, we could:

1) Run failed tests multiple times. If it is intermittent but not marked
as such, we fail the test run.
2) Run marked intermittent tests multiple times. If it works all 25
times, fail the test run for inconsistent metadata.
3) Integrate intermittent failures into TBPL/Orange Factor better.

To address David Baron's concern about silently passing intermittently
failing tests, yes, silently passing is wrong. But I would argue it is
the lesser evil of disabling tests outright.

I think we can all agree that the current approach of disabling failing
tests (the equivalent of sweeping dust under the rug) isn't sustainable.
But if it's the Sherriffs' job to keep the trees green and their only
available recourse is to disable tests, well, they are going to disable
tests. We need more metadata and tooling around disabled tests and we
needed it months ago.
L. David Baron
2014-04-09 18:29:25 UTC
Permalink
Post by Gregory Szorc
The simple solution is to have a separate in-tree manifest
annotation for intermittents. Put another way, we can describe
exactly why we are not running a test. This is kinda/sorta the realm
of bug 922581.
The harder solution is to have some service (like orange factor)
keep track of the state of every test. We can have a feedback loop
whereby test automation queries that service to see what tests
should run and what the expected result is. Of course, we will want
that integration to work locally so we have consistent test
execution between automation and developer machines.
I think both of these are bad.

It should be visible near the tests whether they are running or not,
rather than out of band, so that module owners and those working on
the code are aware of the testing coverage.

Annotating something as intermittent is halfway to disabling, and
should thus be reviewed by test authors / module owners just like
disabling should be; it sounds like you're proposing changing that
as well.

The latter solution also breaks being able to describe a test run
with reference to a revision in the VCS repository.

-David
--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)
Gregory Szorc
2014-04-09 18:48:02 UTC
Permalink
Post by L. David Baron
Post by Gregory Szorc
The simple solution is to have a separate in-tree manifest
annotation for intermittents. Put another way, we can describe
exactly why we are not running a test. This is kinda/sorta the realm
of bug 922581.
The harder solution is to have some service (like orange factor)
keep track of the state of every test. We can have a feedback loop
whereby test automation queries that service to see what tests
should run and what the expected result is. Of course, we will want
that integration to work locally so we have consistent test
execution between automation and developer machines.
I think both of these are bad.
It should be visible near the tests whether they are running or not,
rather than out of band, so that module owners and those working on
the code are aware of the testing coverage.
I would love this all to be in band and part of version control (in the
test manifests). That's why I propose putting things in test manfiests
today and for the foreseeable future.

I just think we'll ultimately reach a stage where we want to leverage
"big data" for more intelligently running tests. We'll cross that bridge
when we get to it, I reckon.
Post by L. David Baron
Annotating something as intermittent is halfway to disabling, and
should thus be reviewed by test authors / module owners just like
disabling should be; it sounds like you're proposing changing that
as well.
Absolutely not! I am very disappointed with the current dynamic between
sheriffs and module owners and test authors because disabling tests is
leading to worse test coverage and opening ourselves up to all kinds of
risks. I'd like to think test authors and module owners should have the
last word. But we have been doing a pretty crappy job of fixing our
broken tests. I feel a lot of people just shrug shoulders and allow the
test to be disabled (I'm guilty of it as much as anyone). From my
perspective, it's difficult to convince the powers at be that fixing
intermittent failures (that have been successfully swept under a rug and
are out of sight and out of mind) is more important than implementing
some shiny new feature (that shows up on an official goals list). I feel
we all need to treat failing tests with more urgency. The engineering
culture is not currently favoring that.
Post by L. David Baron
The latter solution also breaks being able to describe a test run
with reference to a revision in the VCS repository.
Not necessarily. It would add complexity to the service, but if captured
as a requirement is certainly doable.
Chris Peterson
2014-04-09 22:46:05 UTC
Permalink
I feel a lot of people just shrug shoulders and allow the test to be
disabled (I'm guilty of it as much as anyone). From my perspective, it's
difficult to convince the powers at be that fixing intermittent failures
(that have been successfully swept under a rug and are out of sight and
out of mind) is more important than implementing some shiny new feature
(that shows up on an official goals list). I feel we all need to treat
failing tests with more urgency. The engineering culture is not
currently favoring that.
How can we realign our testing priorities, short of some management
decree? Sheriffs have had to disable entire test suites, but even that
has not motivated test owners to fix their intermittent tests. Part of
the problem, I understand, is that many tests have no owner, so the
responsibility for even debugging test failures is not clear.

How much effort do we want to spend creating tools to manage and rerun
known intermittent tests compared to diagnosing the root problems?
Earlier in this thread, Ehsan said that, once upon a time, we had no
intermittent tests. Someone else suggested (half-jokingly, I assume :)
that requiring a 100% green TBPL before landing on inbound would be an
effective motivator to fix intermittent tests.


chris
Ehsan Akhgari
2014-04-09 23:32:03 UTC
Permalink
Post by Chris Peterson
I feel a lot of people just shrug shoulders and allow the test to be
disabled (I'm guilty of it as much as anyone). From my perspective, it's
difficult to convince the powers at be that fixing intermittent failures
(that have been successfully swept under a rug and are out of sight and
out of mind) is more important than implementing some shiny new feature
(that shows up on an official goals list). I feel we all need to treat
failing tests with more urgency. The engineering culture is not
currently favoring that.
How can we realign our testing priorities, short of some management
decree?
Speaking from my experience on this over the years, I don't think we can
(and we're working on that!)
Post by Chris Peterson
Sheriffs have had to disable entire test suites, but even that
has not motivated test owners to fix their intermittent tests. Part of
the problem, I understand, is that many tests have no owner, so the
responsibility for even debugging test failures is not clear.
How much effort do we want to spend creating tools to manage and rerun
known intermittent tests compared to diagnosing the root problems?
Earlier in this thread, Ehsan said that, once upon a time, we had no
intermittent tests. Someone else suggested (half-jokingly, I assume :)
that requiring a 100% green TBPL before landing on inbound would be an
effective motivator to fix intermittent tests.
I am not convinced we need many additional tools at this point. All of
the tools in the world won't help unless we have people using them!
Without that, I think discussing what tools to build might be premature.

Cheers,
Ehsan
jmaher
2014-04-15 13:21:31 UTC
Permalink
I want to express my thanks to everyone who contributed to this thread. We have a lot of passionate and smart people who care about this topic- thanks again for weighing in so far.

Below is a slightly updated policy from the original, and following that is an attempt to summarize the thread and turn what makes sense into actionable items.

= Policy for handling intermittent oranges =

This policy will define an escalation path for when a single test case is identified to be leaking or failing and is causing enough disruption on the trees. Disruption is defined as:
1) Test case is on the list of top 20 intermittent failures on Orange Factor (http://brasstacks.mozilla.com/orangefactor/index.html)
2) It is causing oranges >=8% of the time
3) We have >100 instances of this failure in the bug in the last 30 days

Escalation is a responsibility of all developers, although the majority will fall on the sheriffs.

Escalation path:
1) Ensure we have a bug on file, with the test author, reviewer, module owner, and any other interested parties, links to logs, etc.
2) We need to needinfo? and expect a response within 2 business days, this should be clear in a comment.
3) In the case we don't get a response, request a needinfo? from the module owner
with the expectation of 2 days for a response and getting someone to take action.
4) In the case we go another 2 days with no response from a module owner, we will disable the test.

Ideally we will work with the test author to either get the test fixed or disabled depending on available time or difficulty in fixing the test. If a bug has activity and work is being done to address the issue, it is reasonable to expect the test will not be disabled. Inactivity in the bug is the main cause for escalation.

This is intended to respect the time of the original test authors by not throwing emergencies in their lap, but also strike a balance with keeping the trees manageable.

Exceptions:
1) If this test has landed (or been modified) in the last 48 hours, we will most likely back out the patch with the test
2) If a test is failing at least 30% of the time, we will file a bug and disable the test first
3) When we are bringing a new platform online (Android 2.3, b2g, etc.) many tests will need to be disabled prior to getting the tests on tbpl.
4) In the rare case we are disabling the majority of the tests (either at once or slowly over time) for a given feature, we need to get the module owner to sign off on the current state of the tests.


= Documentation =
We have thousands of tests disabled, many are disabled for different build configurations or platforms. This can be dangerous as we slowly reduce our coverage. By running a daily report (bug 996183) to outline the total tests available vs each configuration (b2g, debug, osx, e10s, etc.) we can bring visibility to the state of each platform and if we are disabling more than we fix.

We need to have a clear guide on how to run the tests, how to write a test, how to debug a test, and use metadata to indicate if we have looked at this test and when.

When an intermittent bug is filed, we need to clearly outline what information will aid the most in reproducing and fixing this bug. Without a documented process for fixing oranges, this falls on the shoulders of the original test authors and a few determined hackers.


= General Policy =
I have adjusted the above policy to mention backing out new tests which are not stable, working to identify a regression in the code or tests, and adding protection so we do not disable coverage for a specific feature completely. In addition, I added a clearer definition of what is a disruptive test and clarified the expectations around communicating in the bug vs escalating.

What is more important is the culture we have around commiting patches to Mozilla repositories. We need to decide as an organization if we care about zero oranges (or insert acceptable percentage). We also need to decide what is acceptable coverage levels and what our general policy is for test reviews (at checkin time and in the future). These need to be answered outside of this policy- but the sooner we answer these questions, the better we can all move forward towards the same goal.


= Tools =
Much of the discussion was around tools. As a member of the Automation and Tools team, I should be advocating for more tools, in this case I am leaning more towards less tools and better process.

One common problem is dealing with the noise around infrastructure and changing environments and test harnesses. Is this documented, how can we filter that out? Having our tools support ways to detect this and annotate changes unrelated to tests or builds will go a long way. Related is updating our harnesses and the way we run tests so they are more repeatable. I have filed bug 996504 to track work on this.

Another problem we can look at with tooling is annotating the expected outcome of the tests with metadata (suggestions of manifest as well as external server). Once we get there we have options such as:
* rerunning tests (until they pass, or to document failure patterns)
* putting all oranges in their own suite
* ignoring results of known oranges

Of course no discussion would be complete without talking about what we could do if this problem were solved. Honorable mentions are:
* Autoland
* Orange Factor / Test Statistics
* Auto Bisection


Happy hacking,
Joel
Kyle Huey
2014-04-15 13:42:25 UTC
Permalink
Post by jmaher
1) Test case is on the list of top 20 intermittent failures on Orange Factor (http://brasstacks.mozilla.com/orangefactor/index.html)
2) It is causing oranges >=8% of the time
3) We have >100 instances of this failure in the bug in the last 30 days
Are these conditions joined by 'and' or by 'or'? If 'or', there will
always be at least 20 tests meeting this set of criteria ...

- Kyle
jmaher
2014-04-15 13:51:04 UTC
Permalink
Post by Kyle Huey
Post by jmaher
1) Test case is on the list of top 20 intermittent failures on Orange Factor (http://brasstacks.mozilla.com/orangefactor/index.html)
2) It is causing oranges >=8% of the time
3) We have >100 instances of this failure in the bug in the last 30 days
Are these conditions joined by 'and' or by 'or'? If 'or', there will
always be at least 20 tests meeting this set of criteria ...
- Kyle
Great question Kyle-

The top 20 doesn't always include specific tests- sometimes it is infrastructure, hardware/vm, or test harness, mozharness, etc. related. If a test meets any of the above criteria and is escalated, then we should expect to follow some basic criteria about either working on fixing it or disabling it as spelled out in the escalation path.

For the large majority of the cases, bugs filed for specific test cases will meet all 3 conditions. We have had some cases where we have thousands of stars over years, but it isn't on the top 20 list all the time. Likewise when we have 10 infra bugs, a frequent orange on the trees won't be in the top 20.

-Joel
Karl Tomlinson
2014-04-16 00:06:39 UTC
Permalink
Thank you for putting this together. It is important.
Post by jmaher
This policy will define an escalation path for when a single test case is
identified to be leaking or failing and is causing enough disruption on the
trees.
1) If this test has landed (or been modified) in the last 48 hours, we will
most likely back out the patch with the test
2) If a test is failing at least 30% of the time, we will file a bug and
disable the test first
I have adjusted the above policy to mention backing out new
tests which are not stable, working to identify a regression in
the code or tests,
I see the exception for regressions from test changes, but I
didn't notice mention of regressions from code.

If a test has started failing intermittently and is failing at
least 30% of the time, then I expect it is not difficult to
identify the trigger for the regression. It is much more
difficult to identify an increase in frequency of failure.

Could we make it clear that preferred solution is to back out the
cause of the regression? Either something in the opening
paragraph or perhaps add the exception:

"If a regressing push or series of pushes can be identified then
the changesets in those pushes are reverted."

We won't always have a single push because other errors preventing
tests from running are often not backed out until after several
subsequent pushes.
jmaher
2014-04-18 19:08:12 UTC
Permalink
I have made small edits thanks to Kyle and Karl, the official policy is posted on the Sheriffing wiki page:
https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy

Karl Tomlinson
2014-04-09 21:07:17 UTC
Permalink
Post by Gregory Szorc
2) Run marked intermittent tests multiple times. If it works all
25 times, fail the test run for inconsistent metadata.
We need to consider intermittently failing tests as failed, and we
need to only test things that always pass.

We can't rely on statistics to tell us about changes in test
results without huge expense. If we have many intermittent tests
running many times, then sometimes they will fail in all 25 runs
and sometimes they will pass in all 25 runs.
Gregory Szorc
2014-04-09 21:29:20 UTC
Permalink
Post by Karl Tomlinson
Post by Gregory Szorc
2) Run marked intermittent tests multiple times. If it works all
25 times, fail the test run for inconsistent metadata.
We need to consider intermittently failing tests as failed, and we
need to only test things that always pass.
I strongly reject the second part of that sentence. That world leads to
a state where we know less and don't test as much. It throws away a huge
amount of valuable data.
Post by Karl Tomlinson
We can't rely on statistics to tell us about changes in test
results without huge expense. If we have many intermittent tests
running many times, then sometimes they will fail in all 25 runs
and sometimes they will pass in all 25 runs.
I threw out 25 as a placeholder. We could have metadata differentiate
between tests that are expected to fail 1% or 50% and treat them
appropriately.

If you are concerned about costs of testing, I submit to you Microsoft's
costs for not adequately testing the browser choice screen: a €561
million fine [1]. I don't want to be the person explaining to the
lawyers that we disabled a test that was verifying our legal compliance
with a law or contract because it failed intermittently 2% of the time.

[1] http://europa.eu/rapid/press-release_IP-13-196_en.htm
Karl Tomlinson
2014-04-07 02:26:17 UTC
Permalink
Post by jmaher
overburdened in other ways (e.g., reviews). the burden
needs to be placed on the regressing change rather than the original
author of the test.
I am open to ideas to help figure out the offending changes. My
understanding is many of the test failures are due to small
adjustments to the system or even the order the tests are run
such that it causes the test to fail intermittently.
In the latter case, the burden should fall on the regressing patch,
and the regressing patch should get backed out instead of disabling
the test.
The majority of the time identifying the regressing patch is
difficult, and something that requires help from someone who can debug
the test. Hence step 1). Steps 2) and 3) are an escalation path that
might result in disabling the test or fixing it, but the primary
purpose is to ensure that decision is reached with full visibility
from the relevant owners/developers.
Often enough reasonable effort does identify the cause of the regression.

For tests failing more than 20% of the time, it should not be too
difficult to identify the regressing push through retriggers.

The key question is who bears this responsibility?
David points out that authors and module owners can't keep up with
this. How much investigation is required before the
responsibility passes to the author or owner? I'd like to see
this addressed in a policy.

Sometimes, even after the regressing changes are identified, there
is no response from the author or reviewer of the regressing
changes. A back out is the right action here, but this is not
always what happens.

A policy is important to ensure that there is some path though
even when there is no response from the author of the regressing
change, and then no response from the test author or owner.
Andrew Halberstadt
2014-04-07 03:53:14 UTC
Permalink
Post by Ehsan Akhgari
Post by L. David Baron
Are you talking about newly-added tests, or tests that have been
passing for a long time and recently started failing?
In the latter case, the burden should fall on the regressing patch,
and the regressing patch should get backed out instead of disabling
the test.
We have no good way of identifying the regressing patch in many cases.
In cases where the regression isn't caught right away, orangefactor is
currently the only tool at our disposal. While it is great for
identifying priorities to fix, it is less great at drilling down through
raw data to determine regression ranges. It is also very noisy which can
make it hard to interpret the data.

I think we need to spend some time:
A) Exposing more raw data for tools like orangefactor to consume (e.g
resource usage)
B) Figuring out how to separate the signal from the noise
C) Making it super easy to drill through this data to identify a
regression range.
jmaher
2014-04-04 19:49:45 UTC
Permalink
Post by L. David Baron
Post by jmaher
4) In the case we go another 2 days with no response from a module owner, we will disable the test.
Are you talking about newly-added tests, or tests that have been
passing for a long time and recently started failing?
In the latter case, the burden should fall on the regressing patch,
and the regressing patch should get backed out instead of disabling
the test.
I had overlooked a new test- I agree that backing it out is the right thing.
Post by L. David Baron
If this plan is applied to existing tests, then it will lead to
style system mochitests being turned off due to other regressions
because I'm the person who wrote them and the module owner, and I
don't always have time to deal with regressions in other parts of
code (e.g., the JS engine) leading to these tests failing
intermittently.
If that happens, we won't have the test coverage we need to add new
CSS properties or values.
Interesting point. Are these tests failing often? Can we invest some minimal time into these to make them more reliable from a test case or test harness perspective?

As long as there is a dialog in the bugs filed, I would find it hard to believe we would just disable a test and it come as a surprise.
Post by L. David Baron
More generally, it places a much heavier burden on contributors who
have been part of the project longer, who are also likely to be
overburdened in other ways (e.g., reviews). That's why the burden
needs to be placed on the regressing change rather than the original
author of the test.
I am open to ideas to help figure out the offending changes. My understanding is many of the test failures are due to small adjustments to the system or even the order the tests are run such that it causes the test to fail intermittently.

I know there are a lot of new faces to the Mozilla community every month, could we offload some (not all) of this work to mozillians with less on their plate?
Post by L. David Baron
These 10% and 50% numbers don't feel right to me; I think the
thresholds should probably be substantially lower. But I think it's
easier to think about these numbers in failures/day, at least for
me.
Great feedback on this. Maybe we pick the top 10 from orange factor (http://brasstacks.mozilla.com/orangefactor/index.html), or we cut the numbers in half. 10% and 50% were sort of last resort numbers I came up with, ideally there would have already been a conversation/bug about the problem.
Post by L. David Baron
Post by jmaher
2) When we are bringing a new platform online (Android 2.3, b2g, etc.) many tests will need to be disabled prior to getting the tests on tbpl.
That's reasonable as long as work is done to try to get the tests>
enabled (at a minimum, actually enabling all the tests that are
passing reliably, rather than stopping after enabling the passing
tests in only some directories).
One this I have heard is coming online is a way to track the # of tests available/disabled per platform, that would really help with ensuring we are not ignoring thousands of tests on a new platform.
L. David Baron
2014-04-04 23:48:24 UTC
Permalink
Post by jmaher
Post by L. David Baron
If this plan is applied to existing tests, then it will lead to
style system mochitests being turned off due to other regressions
because I'm the person who wrote them and the module owner, and I
don't always have time to deal with regressions in other parts of
code (e.g., the JS engine) leading to these tests failing
intermittently.
If that happens, we won't have the test coverage we need to add new
CSS properties or values.
Interesting point. Are these tests failing often? Can we invest some minimal time into these to make them more reliable from a test case or test harness perspective?
They're not failing often. But there has been at least one (and I
think more) occasion where changes in very unrelated code (one was
tracked down to a JS JIT change) caused them, and a bunch of other
tests, to start failing intermittently at higher frequency. I think
detecting this sort of pattern, of a low level change causing large
numbers of tests to fail at moderate frequency, should be detected
in a way that doesn't lead to us disabling large numbers of tests
(and then probably failing to re-enable them when the problem is
resolved).

I think we could use better tools for finding regression windows for
intermittent failures.

I'd also note that I've been trying to give frequent intermittent
oranges in layout a reasonably high priority, and encourage others
to do the same. I think we've been doing a reasonably good job at
this lately, although rates of progress have varied.

-David
--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla https://www.mozilla.org/ 𝄂
Before I built a wall I'd ask to know
What I was walling in or walling out,
And to whom I was like to give offense.
- Robert Frost, Mending Wall (1914)
Karl Tomlinson
2014-04-07 02:33:39 UTC
Permalink
Post by jmaher
2) When we are bringing a new platform online (Android 2.3, b2g, etc.) many
tests will need to be disabled prior to getting the tests on tbpl.
It makes sense to disable some tests so that others can run.

I assume bugs are filed for the disabled tests.

What is difficult here is that there is no clear path to ensure
that sufficient tests are running on the new platform before tests
are stopped on an older version of the platform.
Loading...