Closed
Bug 1138234
Opened 9 years ago
Closed 9 years ago
b2g_bumper stalling frequently
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
6.59 KB,
text/plain
|
Details |
eg: [cltbld@buildbot-master66.bb.releng.usw2.mozilla.com ~]$ date Sun Mar 1 13:09:37 PST 2015 [cltbld@buildbot-master66.bb.releng.usw2.mozilla.com ~]$ ps uxf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND cltbld 2343 0.0 0.0 106096 1252 ? Ss 00:20 0:00 /bin/bash /usr/local/bin/run_b2g_bumper.sh cltbld 2360 0.0 0.0 106096 768 ? S 00:20 0:00 \_ /bin/bash /usr/local/bin/run_b2g_bumper.sh cltbld 5889 0.0 0.3 206096 12504 ? S 00:24 0:00 \_ python /builds/b2g_bumper/mozharness/scripts/b2g_bumper.py --base-work-dir /builds/b2g_bumper/v2.2 -c /builds/b2g_bumper/mozharness/configs/b2g_bumper/v2.2.py --import-git-ref-cache --pus cltbld 5890 0.0 0.2 83296 8236 ? S 00:24 0:00 \_ python /usr/local/bin/hgtool.py --bundle https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/bundles/.hg https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/ /builds/b2g_bumper/v2.2/bu cltbld 5893 0.0 0.5 102536 22128 ? S 00:24 0:00 \_ /tools/python27/bin/python2.7 /usr/local/bin/hg pull https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/ At least 3 times in the last week, eg bug 1136148, and seems to be less well behaved since bm66 was rebuilt. Commonality is a hung hg pull, waiting to receive data from hg.m.o. IIRC legacy vcs-sync has magic to detect this. Bug 1088590 will add a 30 minute timeout, but I'm not sure what b2g_bumper will do with the exception afterwards.
Comment 1•9 years ago
|
||
mgerva: I believe you were working on hg timeouts in mozharness - is this something where we could use your nice work?
Flags: needinfo?(mgervasini)
Comment 2•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #0) > Bug 1088590 will add a 30 > minute timeout, but I'm not sure what b2g_bumper will do with the exception > afterwards. As long as the operation timesout, whatever happens in b2g bumper, on a subsequent run it should recover. Therefore if the mozharness process fails out, we should be fine, as on the next run it should be ok.
Comment 3•9 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #1) > mgerva: I believe you were working on hg timeouts in mozharness - is this > something where we could use your nice work? Unfortunately, the bug I am working, 1088590, adds a timeout for hg operations in tools, not in mozharness.
Flags: needinfo?(mgervasini)
Comment 4•9 years ago
|
||
drat
Comment 5•9 years ago
|
||
I guess we should have a maximum timeout for all hg operations in mozharness, so that all mozharness scripts are protected from hanging hg processes. We could optionally have a lower timeout possible for a job that wants it, but e.g. a maximum timeout of an hour would seem reasonable to me for any hg operation. This should help stability of all cron'd jobs such as vcs sync, b2g bumper, that are ok to fail intermittently as on subsequent runs they can recover. I saw in bug 1109346 that we now have tc-vcs in mozharness - might also be an option.
Comment 6•9 years ago
|
||
see also bug 1113944 for the likely root cause. Not making a blocker as it appears stalled out.
See Also: → vcshangs
Reporter | ||
Comment 7•9 years ago
|
||
(In reply to Massimo Gervasini [:mgerva] from comment #3) > Unfortunately, the bug I am working, 1088590, adds a timeout for hg > operations in tools, not in mozharness. I think we actually end up using hgtool.py when running b2g_bumper, see the process list above. Not the copy in the tools repo, but one at bm66:/usr/local/bin/hgtool.py which comes from puppet (puppet/file/default/modules/packages/files/hgtool.py via puppet/modules/packages/manifests/mozilla/hgtool.pp). So that's a static copy, where tools/buildfarm/utils/package-script.py has been used to include the deps. We could update it after bug 1088590 lands. Then we'll have a 30 min timeout on pull, with a retry via http://hg.mozilla.org/build/mozharness/file/default/mozharness/base/vcs/vcsbase.py#l83.
Reporter | ||
Comment 8•9 years ago
|
||
Stalled again at 19:56 on b2g-inbound, gave the hg pull a HUP.
Reporter | ||
Comment 9•9 years ago
|
||
Two more today, mozilla-b2g30_v1_4 & b2g-inbound.
Reporter | ||
Comment 10•9 years ago
|
||
bug 1088590 is complete, someone want to have a go at updating the copy used by b2g_bumper ? See comment #7 for more info. Shorter than 30 minute timeout would be a bonus.
Depends on: 1088590
Comment 12•9 years ago
|
||
Hey Nick, I see mgerva has landed the updated hg tool in puppet in dep bug 1143610 - can you confirm if this has resolved things in b2g bumper land, or if you are still seeing issues there? If all good I guess we can close this bug. =) Thanks! Pete
Flags: needinfo?(nthomas)
Reporter | ||
Comment 13•9 years ago
|
||
[The dates in 'moznet_#buildduty_20150319.log' are NZDT, followed by timestamps in PDT, to keep your brain nimble] Nagios is still alerting since the change landed - three stalls on the 17th (Pacific), two each on the 18th and 22nd, one each on the 23 and 25th. On the 17th the lag was up to 4062 seconds. I haven't personally touched the bumper box in any of those cases, so I'll pass the ni onto jlund to see if he has. We may want to let this go a little longer to get a bit more data.
Flags: needinfo?(nthomas) → needinfo?(jlund)
Comment 14•9 years ago
|
||
I *think* it is better. I have not personally had to intervene with bumper in the last week. I'll bring this up at the buildduty standup tomorrow but I agree we it won't hurt to check back on this in a week's time.
Flags: needinfo?(jlund)
Comment 15•9 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #14) > I *think* it is better. I have not personally had to intervene with bumper > in the last week. I'll bring this up at the buildduty standup tomorrow but I > agree we it won't hurt to check back on this in a week's time. We'll re-open this if necessary.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•7 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•