Closed Bug 733663 Opened 12 years ago Closed 7 years ago

Consider limiting how often or for how long we RETRY from hg_errors

Categories

(Release Engineering :: General, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dholbert, Assigned: catlee)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2670] [retry][hg])

Attachments

(1 file)

A Try push of mine seems to have hit an infra problem of some sort (bug 733658).

That infra problem seems to be of the sort that triggers auto-rebuilds (which are then doomed to failure, and that failure triggers another auto-rebuild, etc).

As a result, I got 13 rapid-fire emails from Try in the span of ~6 minutes, all for the same platform.

It looks like they've stopped now (maybe because the infra issue cleared itself up, or I got a better buildslave?), but if it hadn't cleared up, I suspect I would have been continuously spammed indefinitely.

Each email I received looked exactly like this:
=========
Your Try Server build (2a9660ea4911) had unknown problem (5) on builder try-win32-debug.

The full log for this build run is available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2603.txt.gz.
=========
...with the only difference being the log number. (03.txt.gz up through 15.txt.gz)

Here are the logs for these failures (so far):
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2603.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2604.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2605.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2606.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2607.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2608.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2609.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2610.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2611.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2612.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2613.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2614.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/dholbert@mozilla.com-2a9660ea4911/try-win32-debug/try-win32-debug-build2615.txt.gz

I'm filing this bug on making Try auto-detect this sort of issue, after a few failed autoretried builds, and stop at that point.  (Otherwise, it apparently can get into a state where it'll just endlessly spam the submitter.)

Maybe that Try feature already exists and the threshold is 13 failed retries, but I suspect not. :)
(In reply to Daniel Holbert [:dholbert] from comment #0)
> I'm filing this bug on making Try auto-detect this sort of issue, after a
> few failed autoretried builds, and stop at that point.  (Otherwise, it
> apparently can get into a state where it'll just endlessly spam the
> submitter.)

(and endlessly occupy build resources, which is also bad, of course.  I just highlighted the emails because they're more annoying -- I was starting to despair after hearing 13 rapid-fire "Ba-da-ding" email-notifications on my phone over 6 minutes and not knowing if they were ever going to stop. :))
(see bug 733658 comment 3 -- looks like the issue here was an hg outage / blip of some sort.   So, that's the sort of infra issue that can trigger this sort of perma-cycle/spam (for the duration of the hg outage))
Summary: Try can get into an infinite loop of builds (spamming endless failure emails to the developer) if it hits an infra perma-fail which triggers an autoretry → Try can get into an infinite loop of builds & spamming endless failure emails to the developer if it hits an infra perma-fail which triggers an autoretry
FWIW, I think the right fix here is bug 712205. Slaves shouldn't even make it back into the slave pool if they can't pull/update tools *and* we'd save time on jobs.
Depends on: 733801
Split the merciless spamming part off to bug 733801.

I don't think bug 712205 will save us, because updating the source repo automatically retries too. hg_errors is playing with fire, we know it is, we'll get burned (again, we already did once where I had to catch several infinite jobs before they could retry again, and cancel them), but because we have a thousand five minute outages of hg.m.o for each three hour outage, we want to keep on playing with fire, and passing the knowledge that this RETRY is RETRY12 on to the next job would be awkward, so although that's what's left for this bug to be about, I'll be surprised if that gets done.
(In reply to Phil Ringnalda (:philor) from comment #4)
> I don't think bug 712205 will save us, because updating the source repo
> automatically retries too. hg_errors is playing with fire, we know it is,
> we'll get burned (again, we already did once where I had to catch several
> infinite jobs before they could retry again, and cancel them), but because
> we have a thousand five minute outages of hg.m.o for each three hour outage,
> we want to keep on playing with fire, and passing the knowledge that this
> RETRY is RETRY12 on to the next job would be awkward, so although that's
> what's left for this bug to be about, I'll be surprised if that gets done.

Yes, bug 712205 is only a partial fix.

hg_errors currently looks like this:

hg_errors = ((re.compile("abort: HTTP Error 5\d{2}"), RETRY),
             (re.compile("abort: .*: no match found!"), RETRY),
             (re.compile("abort: Connection reset by peer"), RETRY),
             (re.compile("transaction abort!"), RETRY),
             (re.compile("abort: error:"), RETRY),)

What subset of those (if any) should we simply mark as FAILURE?
Slotting into Automation because these hg errors can affect any branch, not just try.

(In reply to Chris Cooper [:coop] from comment #5)
> What subset of those (if any) should we simply mark as FAILURE?

Do we need a new status such as UNRECOVERABLE?
Component: Release Engineering → Release Engineering: Automation
QA Contact: release → catlee
Whiteboard: [retry]
Ugh, no, that cure is vastly worse than the disease. I count 118 jobs on mozilla-inbound alone that I didn't have to manually retrigger after last night's brief hg.m.o outage because we automatically retried them.

Resummarizing since as filed it was bug 733801 plus invalid - "infinite loop" is only true if hg.m.o is down infinitely long, in which case none of us will be worrying about what happens because we'll be gone.

If you picture a happy world with both bug 712205 and bug 733801 fixed, then we are only talking about build slaves, and the only impacts of having them continue retrying when IT has said that hg.m.o will be down for three hours are that we'll use electricity running build slaves that don't have anything else they can be doing, and tbpl will get messy with big long strings of blue. Nice to fix, but the unfixed case just means that tree-watching people with nothing else to do since no builds are starting need to go around killing builds to stop them trying again (and they can, because tbpl lists the branches where things are running, unlike the "have to retrigger jobs because they failed when hg.m.o was down for 45 seconds" case, where we have absolutely no way of knowing what trees had failures because of not retrying), or releng needs to take the opportunity to shut down the masters.
Summary: Try can get into an infinite loop of builds & spamming endless failure emails to the developer if it hits an infra perma-fail which triggers an autoretry → Consider limiting how often or for how long we RETRY from hg_errors
Severity: normal → major
Priority: -- → P2
from triage:

this isn't a cause of problems, but it makes a bad situation worse

it's not obvious how to fix this. some hacky ideas:

- have something in the build artificially delay after failing to clone from hg

- rip out/modify buildbot's retry logic and make it support a limited # of retries, with delays in between
Severity: major → normal
Priority: P2 → P3
Whiteboard: [retry] → [retry][hg]
One quick fix could be to add a delay of X seconds between a build finishing with RETRY and the next build being kicked off.
Product: mozilla.org → Release Engineering
Whiteboard: [retry][hg] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2662] [retry][hg]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2662] [retry][hg] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2670] [retry][hg]
Assignee: nobody → catlee
I couldn't think of a nicer way to do this...maybe you have some other ideas?
Attachment #8529931 - Flags: feedback?(dustin)
Comment on attachment 8529931 [details] [diff] [review]
limit # of retries

That's a lot of synchronous DB queries, and more when builds are collapsed.  It seems like that would lead to further performance degredation just when masters fall behind.
Hmm...can you think of a better way to do this?
Just running the queries asynchronously, possibly under a DeferredLock so that only one runs at a time, might help.  Then when the master is busy the effect will be to delay RETRYs further, and limit the number of DB queries to one at a time.
Attachment #8529931 - Flags: feedback?(dustin) → feedback+
This is fixed in Taskcluster. We won't do anything further here for buildbot.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: