Closed Bug 1138234 Opened 9 years ago Closed 9 years ago

b2g_bumper stalling frequently

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

eg:
[cltbld@buildbot-master66.bb.releng.usw2.mozilla.com ~]$ date
Sun Mar  1 13:09:37 PST 2015

[cltbld@buildbot-master66.bb.releng.usw2.mozilla.com ~]$ ps uxf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
cltbld    2343  0.0  0.0 106096  1252 ?        Ss   00:20   0:00 /bin/bash /usr/local/bin/run_b2g_bumper.sh
cltbld    2360  0.0  0.0 106096   768 ?        S    00:20   0:00  \_ /bin/bash /usr/local/bin/run_b2g_bumper.sh
cltbld    5889  0.0  0.3 206096 12504 ?        S    00:24   0:00      \_ python /builds/b2g_bumper/mozharness/scripts/b2g_bumper.py --base-work-dir /builds/b2g_bumper/v2.2 -c /builds/b2g_bumper/mozharness/configs/b2g_bumper/v2.2.py --import-git-ref-cache --pus
cltbld    5890  0.0  0.2  83296  8236 ?        S    00:24   0:00          \_ python /usr/local/bin/hgtool.py --bundle https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/bundles/.hg https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/ /builds/b2g_bumper/v2.2/bu
cltbld    5893  0.0  0.5 102536 22128 ?        S    00:24   0:00              \_ /tools/python27/bin/python2.7 /usr/local/bin/hg pull https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/

At least 3 times in the last week, eg bug 1136148, and seems to be less well behaved since bm66 was rebuilt. Commonality is a hung hg pull, waiting to receive data from hg.m.o.

IIRC legacy vcs-sync has magic to detect this. Bug 1088590 will add a 30 minute timeout, but I'm not sure what b2g_bumper will do with the exception afterwards.
mgerva: I believe you were working on hg timeouts in mozharness - is this something where we could use your nice work?
Flags: needinfo?(mgervasini)
(In reply to Nick Thomas [:nthomas] from comment #0)
> Bug 1088590 will add a 30
> minute timeout, but I'm not sure what b2g_bumper will do with the exception
> afterwards.

As long as the operation timesout, whatever happens in b2g bumper, on a subsequent run it should recover. Therefore if the mozharness process fails out, we should be fine, as on the next run it should be ok.
(In reply to Pete Moore [:pete][:pmoore] from comment #1)
> mgerva: I believe you were working on hg timeouts in mozharness - is this
> something where we could use your nice work?

Unfortunately, the bug I am working, 1088590, adds a timeout for hg operations in tools, not in mozharness.
Flags: needinfo?(mgervasini)
I guess we should have a maximum timeout for all hg operations in mozharness, so that all mozharness scripts are protected from hanging hg processes. We could optionally have a lower timeout possible for a job that wants it, but e.g. a maximum timeout of an hour would seem reasonable to me for any hg operation. This should help stability of all cron'd jobs such as vcs sync, b2g bumper, that are ok to fail intermittently as on subsequent runs they can recover.

I saw in bug 1109346 that we now have tc-vcs in mozharness - might also be an option.
see also bug 1113944 for the likely root cause. Not making a blocker as it appears stalled out.
See Also: → vcshangs
(In reply to Massimo Gervasini [:mgerva] from comment #3)
> Unfortunately, the bug I am working, 1088590, adds a timeout for hg
> operations in tools, not in mozharness.

I think we actually end up using hgtool.py when running b2g_bumper, see the process list above. Not the copy in the tools repo, but one at bm66:/usr/local/bin/hgtool.py which comes from puppet (puppet/file/default/modules/packages/files/hgtool.py via puppet/modules/packages/manifests/mozilla/hgtool.pp). So that's a static copy, where tools/buildfarm/utils/package-script.py has been used to include the deps. We could update it after bug 1088590 lands.

Then we'll have a 30 min timeout on pull, with a retry via http://hg.mozilla.org/build/mozharness/file/default/mozharness/base/vcs/vcsbase.py#l83.
Stalled again at 19:56 on b2g-inbound, gave the hg pull a HUP.
Two more today, mozilla-b2g30_v1_4 & b2g-inbound.
bug 1088590 is complete, someone want to have a go at updating the copy used by b2g_bumper ? See comment #7 for more info. Shorter than 30 minute timeout would be a bonus.
Depends on: 1088590
Blocks: vcshangs
Depends on: 1143610
Hey Nick,

I see mgerva has landed the updated hg tool in puppet in dep bug 1143610 - can you confirm if this has resolved things in b2g bumper land, or if you are still seeing issues there?

If all good I guess we can close this bug. =)

Thanks!
Pete
Flags: needinfo?(nthomas)
Attached file nagios bumper alerts
[The dates in 'moznet_#buildduty_20150319.log' are NZDT, followed by timestamps in PDT, to keep your brain nimble]

Nagios is still alerting since the change landed - three stalls on the 17th (Pacific), two each on the 18th and 22nd, one each on the 23 and 25th. On the 17th the lag was up to 4062 seconds. 

I haven't personally touched the bumper box in any of those cases, so I'll pass the ni onto jlund to see if he has.

We may want to let this go a little longer to get a bit more data.
Flags: needinfo?(nthomas) → needinfo?(jlund)
I *think* it is better. I have not personally had to intervene with bumper in the last week. I'll bring this up at the buildduty standup tomorrow but I agree we it won't hurt to check back on this in a week's time.
Flags: needinfo?(jlund)
(In reply to Jordan Lund (:jlund) from comment #14)
> I *think* it is better. I have not personally had to intervene with bumper
> in the last week. I'll bring this up at the buildduty standup tomorrow but I
> agree we it won't hurt to check back on this in a week's time.

We'll re-open this if necessary.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: