Closed Bug 1154377 Opened 9 years ago Closed 9 years ago

Intermittent OS X build Automation Error: mozprocess timed out after 2400 seconds running ['/tools/buildbot/bin/python', 'mach', '--log-no-times', 'build', '-v'] timed out after 2400 seconds of no output

Categories

(Release Engineering :: General, defect)

x86_64
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: glandium)

References

Details

Attachments

(3 files)

+++ This bug was initially created as a clone of Bug #1145507 +++
Could this be something to do with sccache ? Lots of different mac slaves in scl3, all failing in intl/icu or js.
Flags: needinfo?(mh+mozilla)
Perhaps not that specific actually, but sccache is try-specific IIRC.
I don't know. Maybe. But without looking on a stuck slave, while it's being stuck I can't tell.
Flags: needinfo?(mh+mozilla)
glandium and I dug into this - sccache is having issues and it looks like the server and network communication. So this disables sccache, r+ from Callek on IRC. Jobs starting after 1826 Pacific should be OK.

https://hg.mozilla.org/build/mozharness/rev/2dc80314c97e
https://hg.mozilla.org/build/mozharness/rev/2e1cd6e8c253
Attachment #8593141 - Flags: review+
Attachment #8593141 - Flags: checked-in+
Comment on attachment 8593141 [details] [diff] [review]
[mozharness] Disable sccache on mac

Backed out because I missed the mozharness pinning in-tree, which means it's not going to be effective.

https://hg.mozilla.org/build/mozharness/rev/1d37d6e92c7f
https://hg.mozilla.org/build/mozharness/rev/8466a94c95b2

I'll work out how to do this in buildbot instead.
Attachment #8593141 - Flags: checked-in+ → checked-in-
So, one part of the equation is that for some reason the main sccache server
process stops listening to its port. This part needs further investigation.
Now, this would be less of a problem if the port wasn't still bound and
listening, because the server subprocesses still have an open file descriptor
they inherited from the main sccache process, except they are not handling
incoming connections (they're not supposed to)...
(In reply to Mike Hommey [:glandium] from comment #119)
> So, one part of the equation is that for some reason the main sccache server
> process stops listening to its port. This part needs further investigation.

This part is actually a red herring.

What is happening is that one process gets stuck on a http request to S3. Other processes continue to work, until make doesn't have anything more to do, in which case it waits for that stuck process. Then 5 minutes pass and the sccache server stops listening to its socket because no new request came in. From there on, sccache clients just get stuck if there happens to be any new one for some reason.

> Now, this would be less of a problem if the port wasn't still bound and
> listening, because the server subprocesses still have an open file descriptor
> they inherited from the main sccache process, except they are not handling
> incoming connections (they're not supposed to)...

This is still possibly true, but there are two issues here:
- Network is currently flaky.
- Sccache doesn't handle that very well.
Note the "keeps listening in subprocesses" is specific to OSX. It doesn't happen on linux.
Assignee: nobody → mh+mozilla
Attachment #8593323 - Flags: review?(mshal)
Attachment #8593323 - Flags: review?(mshal) → review+
Comment on attachment 8593324 [details] [diff] [review]
Close listening sockets when forking processes

Any theories as to why the network would be flaky just within the past week or so? (Bug 1153012 was reported on 4/9)
Attachment #8593324 - Flags: review?(mshal) → review+
No idea, someone with knowledge of the network infra should look into it.
Blocks: 1155476
(In reply to Nick Thomas [:nthomas] from comment #118)
> Buildbot fix:
>  https://hg.mozilla.org/build/buildbot-configs/rev/58939afabf7c
>  https://hg.mozilla.org/build/buildbot-configs/rev/c9362236ab6c
> and a typo fix, because it's one of those days
>  https://hg.mozilla.org/build/buildbot-configs/rev/b90ef9540c76
>  https://hg.mozilla.org/build/buildbot-configs/rev/faefb8767843
> 
> Reconfiged the try masters, done at 2100 Pacific.

This can be backed out after the landing of bug 1155476 is picked up by most try pushes.
No longer blocks: 1155476
Depends on: 1155476
<glandium> nthomas: 72 of the past 100 try pushes are using the fixed sccache, fwiw

Waiting a little longer.
Backed out with https://hg.mozilla.org/build/buildbot-configs/rev/23e73092ea0a. By the time this goes live in a reconfig we should be at a higher proportion, and the network would need to still be broken for it to manifest.
I think we can close this bug, now. Especially now that /different/ occurrences have been hit and attributed here. (comment 147)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Heh, like we'd stop starring just because a bug was closed-and-not-the-right-one.
I'd hope you look why something is closed before starring.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: