Closed Bug 1050808 Opened 10 years ago Closed 9 years ago

l10n repacks should retry on failure

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: pmoore, Unassigned)

Details

(Whiteboard: [good first bug][lang=python][releaseduty])

Attachments

(1 file)

So today there were some l10n repack failures, which were caused by an intermittent issue.

I looked at the logs, and couldn't see any sign of retry attempts. Here is an extract of the log from http://buildbot-master85.srv.releng.scl3.mozilla.com:8001/builders/release-mozilla-beta-linux_repack_4%2F10/builds/7/steps/run_script/logs/stdio


.... here we are at the tail end of "fa" locale ....



Uploading /builds/slave/rel-m-beta-lx_rpk_4-0000000000/mozilla-beta/obj-l10n/dist/host/bin/mar
Uploading /builds/slave/rel-m-beta-lx_rpk_4-0000000000/mozilla-beta/obj-l10n/dist/host/bin/mbsdiff
Uploading /builds/slave/rel-m-beta-lx_rpk_4-0000000000/mozilla-beta/obj-l10n/dist/linux-i686/fa/firefox-32.0b5.checksums
Uploading /builds/slave/rel-m-beta-lx_rpk_4-0000000000/mozilla-beta/obj-l10n/dist/linux-i686/fa/firefox-32.0b5.checksums.asc
Running post-upload command: post_upload.py -p firefox -n 1 -v 32.0b5 --release-to-candidates-dir --signed
Upload complete
command: END (8.76s elapsed)

command: START
command: make --no-print-directory echo-variable-CHECKSUM_FILE AB_CD=fa
command: cwd: /builds/slave/rel-m-beta-lx_rpk_4-0000000000/mozilla-beta/obj-l10n/browser/locales
command: env: {'MOZ_MAKE_COMPLETE_MAR': '1', 'AB_CD': 'fa', 'MOZ_SIGN_CMD': 'python /builds/slave/rel-m-beta-lx_rpk_4-0000000000/scripts/release/signing/signtool.py --cachedir /builds/slave/rel-m-beta-lx_rpk_4-0000000000/signing_cache -t /builds/slave/rel-m-beta-lx_rpk_4-0000000000/token -n /builds/slave/rel-m-beta-lx_rpk_4-0000000000/nonce -c /builds/slave/rel-m-beta-lx_rpk_4-0000000000/scripts/release/signing/host.cert -H signing4.srv.releng.scl3.mozilla.com:9120 -H signing5.srv.releng.scl3.mozilla.com:9120 -H signing6.srv.releng.scl3.mozilla.com:9120', 'UPLOAD_SSH_KEY': '~/.ssh/ffxbld_dsa', 'LOCALE_MERGEDIR': '/builds/slave/rel-m-beta-lx_rpk_4-0000000000/mozilla-beta/obj-l10n/browser/locales/merged', 'COMM_REV': 'FIREFOX_32_0b5_RELEASE', 'MOZ_OBJDIR': 'obj-l10n', 'MAR': '/builds/slave/rel-m-beta-lx_rpk_4-0000000000/mozilla-beta/obj-l10n/dist/host/bin/mar', 'UPLOAD_TO_TEMP': '1', 'LD_LIBRARY_PATH': '/tools/gcc-4.3.3/installed/lib', 'ZIP_IN': '/builds/slave/rel-m-beta-lx_rpk_4-0000000000/firefox.tar.bz2', 'UPLOAD_USER': 'ffxbld', 'MBSDIFF': '/builds/slave/rel-m-beta-lx_rpk_4-0000000000/mozilla-beta/obj-l10n/dist/host/bin/mbsdiff', 'UPLOAD_EXTRA_FILES': 'update/linux-i686/fa/firefox-32.0b3-32.0b5.partial.mar update/linux-i686/fa/firefox-32.0b3-32.0b5.partial.mar.asc update/linux-i686/fa/firefox-32.0b4-32.0b5.partial.mar update/linux-i686/fa/firefox-32.0b4-32.0b5.partial.mar.asc', 'UPLOAD_HOST': 'stage.mozilla.org', 'POST_UPLOAD_CMD': 'post_upload.py -p firefox -n 1 -v 32.0b5 --release-to-candidates-dir --signed', 'MOZILLA_REV': 'FIREFOX_32_0b5_RELEASE', 'MOZ_PKG_PRETTYNAMES': '1', 'MOZ_PKG_VERSION': '32.0b5'}
command: output:
'../../dist/linux-i686/fa//firefox-32.0b5.checksums'

command: END (0.03 elapsed)

retry: Calling <bound method ReleaseSubmitterV3.run of <balrog.submitter.cli.ReleaseSubmitterV3 object at 0xeb2dd0>> with args: (), kwargs: {'buildID': '20140807212602', 'productName': 'Firefox', 'hashFunction': 'sha512', 'locale': 'fa', 'partialInfo': [{'previousVersion': '32.0b3', 'hash': '8e1e26564a612e8a838acb7cc8c483217f5b092911cd1e4fd8bd04a4a948ba32f665e92ec8c09f521896be04118dfe2f300433bd1130037a3af022fdcaa6b548', 'size': 3216757}, {'previousVersion': '32.0b4', 'hash': '86020c279db3a7c2f7ea630a06ace6219ae4b961ca737c4933e22f74bbab5a72d7be345371c298a81d37b11bd17eaabc69fdd159277aa98d18eba53041ee36d9', 'size': 4030174}], 'build_number': 1, 'extVersion': '32.0', 'appVersion': '32.0', 'platform': 'linux', 'version': '32.0b5', 'completeInfo': [{'hash': '8e7b32a796bb82260d13fbbcfc8e0aaacc947f35ffb97547530611829b42e40c2c0f1c8f6a6035fba535d4af577864e722db1dd9104e9fcfec59d413d10e88d6', 'size': 42079552}]}, attempt #1
Starting new HTTPS connection (1): aus4-admin.mozilla.org
Starting new HTTPS connection (2): aus4-admin.mozilla.org



.... locale "fa" now all done, now we move on to "ff" locale .....




retry: Calling <function downloadUpdateIgnore404 at 0xe8fc08> with args: ('stage.mozilla.org', 'firefox', '32.0b3', 1, 'linux', 'ff'), kwargs: {}, attempt #1
Downloading http://stage.mozilla.org/pub/mozilla.org/firefox/candidates/32.0b3-candidates/build1/update/linux-i686/ff/firefox-32.0b3.complete.mar to firefox-32.0b3.ff.complete.mar
retry: Calling <function downloadUpdateIgnore404 at 0xe8fc08> with args: ('stage.mozilla.org', 'firefox', '32.0b4', 1, 'linux', 'ff'), kwargs: {}, attempt #1
Downloading http://stage.mozilla.org/pub/mozilla.org/firefox/candidates/32.0b4-candidates/build1/update/linux-i686/ff/firefox-32.0b4.complete.mar to firefox-32.0b4.ff.complete.mar
retry: Calling <function mercurial at 0xe8b668> with args: ('https://hg.mozilla.org/releases/l10n/mozilla-beta/ff', 'l10n/ff'), kwargs: {}, attempt #1
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
Mercurial Distributed SCM (version 2.5.4)
command: END (0.31s elapsed)

Checking if share extension works
command: START
command: hg help share
command: cwd: /builds/slave/rel-m-beta-lx_rpk_4-0000000000
command: END (0.35 elapsed)

Updating shared repo
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
Mercurial Distributed SCM (version 2.5.4)
command: END (0.31s elapsed)

command: START
command: hg clone -U https://hg.mozilla.org/releases/l10n/mozilla-beta/ff /builds/hg-shared/releases/l10n/mozilla-beta/ff
command: cwd: /builds/slave/rel-m-beta-lx_rpk_4-0000000000
command: output:
warning: hg.mozilla.org certificate with fingerprint af:27:b9:34:47:4e:e5:98:01:f6:83:2b:51:c9:aa:d8:df:fb:1a:27 not verified (check hostfingerprints or web.cacerts config setting)
requesting all changes
adding changesets
adding manifests
adding file changes
added 242 changesets with 2326 changes to 489 files (+157 heads)
command: END (2.95s elapsed)

Trying to share /builds/hg-shared/releases/l10n/mozilla-beta/ff to /builds/slave/rel-m-beta-lx_rpk_4-0000000000/l10n/ff
command: START
command: hg share -U /builds/hg-shared/releases/l10n/mozilla-beta/ff /builds/slave/rel-m-beta-lx_rpk_4-0000000000/l10n/ff
command: cwd: /builds/slave/rel-m-beta-lx_rpk_4-0000000000
command: output:
command: END (0.35s elapsed)

command: START
command: hg branch
command: cwd: /builds/slave/rel-m-beta-lx_rpk_4-0000000000/l10n/ff
command: output:
default

command: END (0.36 elapsed)

command: START
command: hg update -C
command: cwd: /builds/slave/rel-m-beta-lx_rpk_4-0000000000/l10n/ff
command: output:
352 files updated, 0 files merged, 0 files removed, 0 files unresolved
command: END (0.64s elapsed)

command: START
command: hg parent --template {node|short}
command: cwd: /builds/slave/rel-m-beta-lx_rpk_4-0000000000/l10n/ff
command: output:
4955e27bab3d
command: END (0.41 elapsed)

command: START
command: hg update -C -r FIREFOX_32_0b5_RELEASE
command: cwd: l10n/ff
command: output:
abort: unknown revision 'FIREFOX_32_0b5_RELEASE'!
command: ERROR
Traceback (most recent call last):
  File "/builds/slave/rel-m-beta-lx_rpk_4-0000000000/scripts/lib/python/util/commands.py", line 47, in run_cmd
    return subprocess.check_call(cmd, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 502, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['hg', 'update', '-C', '-r', 'FIREFOX_32_0b5_RELEASE']' returned non-zero exit status 255
command: END (0.49s elapsed)

Traceback (most recent call last):
  File "/builds/slave/rel-m-beta-lx_rpk_4-0000000000/scripts/scripts/l10n/create-release-repacks.py", line 123, in createRepacks
    buildNumber=buildNumber, stageServer=stageServer)
  File "/builds/slave/rel-m-beta-lx_rpk_4-0000000000/scripts/lib/python/build/l10n.py", line 107, in repackLocale
    update(localeDir, revision=revision)
  File "/builds/slave/rel-m-beta-lx_rpk_4-0000000000/scripts/lib/python/util/hg.py", line 120, in update
    run_cmd(cmd, cwd=dest)
  File "/builds/slave/rel-m-beta-lx_rpk_4-0000000000/scripts/lib/python/util/commands.py", line 47, in run_cmd
    return subprocess.check_call(cmd, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 502, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['hg', 'update', '-C', '-r', 'FIREFOX_32_0b5_RELEASE']' returned non-zero exit status 255
retry: Calling <function downloadUpdateIgnore404 at 0xe8fc08> with args: ('stage.mozilla.org', 'firefox', '32.0b3', 1, 'linux', 'fi'), kwargs: {}, attempt #1


... Oh Noes!!!  - ff locale failed - what shall we do?! ...


... Continue with next locale (fi) ...



Downloading http://stage.mozilla.org/pub/mozilla.org/firefox/candidates/32.0b3-candidates/build1/update/linux-i686/fi/firefox-32.0b3.complete.mar to firefox-32.0b3.fi.complete.mar
retry: Calling <function downloadUpdateIgnore404 at 0xe8fc08> with args: ('stage.mozilla.org', 'firefox', '32.0b4', 1, 'linux', 'fi'), kwargs: {}, attempt #1
Downloading http://stage.mozilla.org/pub/mozilla.org/firefox/candidates/32.0b4-candidates/build1/update/linux-i686/fi/firefox-32.0b4.complete.mar to firefox-32.0b4.fi.complete.mar
retry: Calling <function mercurial at 0xe8b668> with args: ('https://hg.mozilla.org/releases/l10n/mozilla-beta/fi', 'l10n/fi'), kwargs: {}, attempt #1
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
Mercurial Distributed SCM (version 2.5.4)
command: END (0.31s elapsed)



As you see above, when ff fails, it moves on to the next locale (fi) without retrying ff.

If we retry here, we should avoid intermittent issues.

It is also worth noting hg -q version gets called twice per locale, even though it is same for every locale, and also hg help share gets called for every locale. Not such a bad thing, but thought I'd mention it (catlee mentioned this is probably because it is using hgtool).
So I've written the attached patch, but currently don't have a working test environment to test it in. Attaching in case anyone else has time/interest to look at this, otherwise I'll pick this up later when I have time again.

Ben, in principle does this look like a reasonable change to you?
Flags: needinfo?(bhearsum)
Comment on attachment 8470084 [details] [diff] [review]
bug1050808_tools_v1.patch

Review of attachment 8470084 [details] [diff] [review]:
-----------------------------------------------------------------

As I mentioned on Vidyo, I don't think we should be retrying this overall method. Lots of things that happen in this function (including nearly all of the "make" calls) can fail due to reasons that a retry will not help with - which just ends up causing unnecessary churn on the slave.

Instead, I would suggest retrying operations in the function itself that depend on the network and such: https://github.com/mozilla/build-tools/blob/master/lib/python/build/l10n.py#L99

It might be a little more complex than that though, because the errors in comment #0 are an "hg update" call failing, after a clone or pull succeeded. A proper retry of this probably involves wrapping the clone+update in a single function, and retrying that instead.
Attachment #8470084 - Flags: feedback-
Flags: needinfo?(bhearsum)
Whiteboard: [good first bug][lang=python]
(In reply to Ben Hearsum [:bhearsum] from comment #2)
> Comment on attachment 8470084 [details] [diff] [review]
> bug1050808_tools_v1.patch
> 
> Review of attachment 8470084 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> As I mentioned on Vidyo, I don't think we should be retrying this overall
> method. Lots of things that happen in this function (including nearly all of
> the "make" calls) can fail due to reasons that a retry will not help with -
> which just ends up causing unnecessary churn on the slave.
> 
> Instead, I would suggest retrying operations in the function itself that
> depend on the network and such:
> https://github.com/mozilla/build-tools/blob/master/lib/python/build/l10n.
> py#L99
> 
> It might be a little more complex than that though, because the errors in
> comment #0 are an "hg update" call failing, after a clone or pull succeeded.
> A proper retry of this probably involves wrapping the clone+update in a
> single function, and retrying that instead.

I agree that it is inefficient to retry the entire repack, but also since bug 740142 should be landing soon, it is probably not worth the investment of effort to refactor in order to make this more efficient, since when 740142 lands, this will no longer be needed.

It is worth noting though, at the moment, if a repack fails, we manually trigger the entire repack code (including make calls etc) and so if we do land this fix, we do not lose efficiencies, since we already have this overhead already, plus the overhead of manual intervention, and the risk of human error. Therefore based on both these points (740142 landing soon and current inefficiencies) I would propose it may still be worth landing this fix, rather than investing in effort to refactor this fix to be more efficient.
Flags: needinfo?(bhearsum)
(In reply to Pete Moore [:pete][:pmoore] from comment #3)
> (In reply to Ben Hearsum [:bhearsum] from comment #2)
> > Comment on attachment 8470084 [details] [diff] [review]
> > bug1050808_tools_v1.patch
> > 
> > Review of attachment 8470084 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > As I mentioned on Vidyo, I don't think we should be retrying this overall
> > method. Lots of things that happen in this function (including nearly all of
> > the "make" calls) can fail due to reasons that a retry will not help with -
> > which just ends up causing unnecessary churn on the slave.
> > 
> > Instead, I would suggest retrying operations in the function itself that
> > depend on the network and such:
> > https://github.com/mozilla/build-tools/blob/master/lib/python/build/l10n.
> > py#L99
> > 
> > It might be a little more complex than that though, because the errors in
> > comment #0 are an "hg update" call failing, after a clone or pull succeeded.
> > A proper retry of this probably involves wrapping the clone+update in a
> > single function, and retrying that instead.
> 
> I agree that it is inefficient to retry the entire repack, but also since
> bug 740142 should be landing soon, it is probably not worth the investment
> of effort to refactor in order to make this more efficient, since when
> 740142 lands, this will no longer be needed.
> 
> It is worth noting though, at the moment, if a repack fails, we manually
> trigger the entire repack code (including make calls etc) and so if we do
> land this fix, we do not lose efficiencies, since we already have this
> overhead already, plus the overhead of manual intervention, and the risk of
> human error. Therefore based on both these points (740142 landing soon and
> current inefficiencies) I would propose it may still be worth landing this
> fix, rather than investing in effort to refactor this fix to be more
> efficient.

One of the things I'm concerned about here is unnecessary churn. In many cases when repacks fail, the problem cannot be fixed by retrying. Just last night we had issues with a translation that caused failures. We also hit issues that cause all locales to fail in ways where a retry doesn't help sometimes. Retrying in cases like this is a waste of machine time, obviously, and retrying for them also means it takes longer (5x longer, if you use the default number of attempts) for a human to notice and fix the failures. With repacks being in the critical path of a release, it's a real concern.

If it's not worthwhile refactoring to retry more efficiently, I think we're better off without this, sorry :(.
Flags: needinfo?(bhearsum)
(In reply to Ben Hearsum [:bhearsum] from comment #4) 
> One of the things I'm concerned about here is unnecessary churn. In many
> cases when repacks fail, the problem cannot be fixed by retrying. Just last
> night we had issues with a translation that caused failures. We also hit
> issues that cause all locales to fail in ways where a retry doesn't help
> sometimes. Retrying in cases like this is a waste of machine time,
> obviously, and retrying for them also means it takes longer (5x longer, if
> you use the default number of attempts) for a human to notice and fix the
> failures. With repacks being in the critical path of a release, it's a real
> concern.
> 
> If it's not worthwhile refactoring to retry more efficiently, I think we're
> better off without this, sorry :(.

That's fair, but I also think the hg operations still fail frequently enough that wrapping them in retry is worth doing. 

Let's get this added to the releaseduty queue.
Whiteboard: [good first bug][lang=python] → [good first bug][lang=python][releaseduty]
Hello everybody
I am new to mozilla development and I would like to start with this bug. If you could show me the way around I would like to solve it.

Cheers
Hi carocad,

Thanks for you interest in contributing to Mozilla!

Unfortunately, it looks like the code related to this bug is going away soon.

Would you be interested in helping out with bug 1048822 or bug 1072026 instead?
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: