Closed Bug 1137322 Opened 9 years ago Closed 9 years ago

osx test slaves are failing to download a test zip from similiar rev

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jlund, Unassigned)

Details

there was a spike in failed download attempts from ftp. sheriffs reported 5 or 6 instances close together around 09:30 PT

log example: https://treeherder.mozilla.org/logviewer.html#?job_id=7024399&repo=mozilla-inbound

snippet: 09:24:14 INFO - Can't download from https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64/1424962152/firefox-39.0a1.en-US.mac.tests.zip to /builds/slave/talos-slave/test/build/firefox-39.0a1.en-US.mac.tests.zip! 

host example of a slave trying to download from ftp: t-snow-r4-0061

builder: builder: mozilla-inbound_snowleopard_test-mochitest-2
sheriffs have reported that this is coming from hosts sharing the same revision of change.

Usul has reported no anomalies within ftp.

I am starting to suspect that since this is the same rev and thus far all osx, we uploaded a corrupted test zip for that revision: https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-macosx64/1424962152/firefox-39.0a1.en-US.mac.tests.zip and that is why test jobs that are trying to download are having issues.

sheriffs are pushing a new rev. I'll leave this bug open while we wait on results.
Summary: downloading from ftp is sporadically failing → osx test slaves are failing to download a test zip from similiar rev
The opt Mac build was double-retriggered, finishing 3 minutes apart. That means that while test jobs triggered by the first retrigger were downloading, the second retrigger was uploading over the top. That doesn't go well for us, and never will as long as retriggers upload over the top of the previous job.
(In reply to Phil Ringnalda (:philor) from comment #2)
> The opt Mac build was double-retriggered, finishing 3 minutes apart. That
> means that while test jobs triggered by the first retrigger were
> downloading, the second retrigger was uploading over the top. That doesn't
> go well for us, and never will as long as retriggers upload over the top of
> the previous job.

Is this happening because we're pulling from latest-* rather than from the revision-specific dir? Do we have a bug on file for that already?
No, it's happening because retriggers intentionally replace the job they redo, in-place. Even though we had the same conversation over and over when we first saw this, I can't remember what catlee's part of it is, where he explains why that's a good thing even though it means that retriggering a job causes all evidence of the original job to completely disappear, and causes this if you retrigger twice, and causes the results of retriggering a build because you didn't like the results of the tests run on it to be completely nondeterministic because you absolutely cannot tell whether the new build or the old build was downloaded by the new test jobs.
Depends on: 1138512
sounds like this issue could be improved. However, since it doesn't seem to happen on a regular frequency (correct me if I'm wrong), I think this will sit on lower priority against what we currently have on our plate.

filed 1138512 to track the effort but I won't leave it open without an assignee.
No longer depends on: 1138512
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.