Closed Bug 1123025 Opened 9 years ago Closed 9 years ago

b2g emulator nightlies (sometimes?) use a test package from a previous nightly

Categories

(Release Engineering :: General, defect, P2)

ARM
Gonk (Firefox OS)
defect

Tracking

(firefox39 fixed, b2g-v1.4 fixed, b2g-v2.0 fixed, b2g-v2.1 fixed, b2g-v2.1S fixed, b2g-v2.2 fixed, b2g-master fixed)

RESOLVED FIXED
Tracking Status
firefox39 --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed
b2g-v2.1S --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: philor, Assigned: nthomas)

References

Details

Attachments

(2 files)

You'll say "but look at the hgtool output!," and I say "nay, nay, look at the clear results that the tests show."

In https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=35df417b93a7 we built an emulator nightly, triggered at 4:02 (why not 4:20, given the result?), which claims to have built on https://hg.mozilla.org/mozilla-central/rev/35df417b93a7 but given the result of the B2G ICS Emulator opt mochitest-9 was actually built on some revision in the middle of https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=b86864fd9d60

I should have realized it yesterday, because in https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=c9162436444e we built a 4:02 emulator nightly that's totally 4:20, because the emulator reftest-20 results unquestionably show that it was actually built on a revision somewhere in https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=47b586de5661 between https://hg.mozilla.org/mozilla-central/rev/daf8243cd190 which caused that failure and https://hg.mozilla.org/mozilla-central/rev/0c2393315416 which marked that test as random on b2g.

If this isn't the result of something obvious I'm not seeing, which can be quickly fixed, please disable the sendchange so they stop running tests.

No, wait, these builds are randomly meaningless, leading QA to say that fish are rocks. If they can't be quickly, like Monday quickly, fixed, please shut off emulator nightlies until they actually run on the tip revision in a push, and say which revision they actually ran on. A build which lies about what it is is worse than useless, it's actively harmful.
Catlee, can you or someone investigate this. With a lean toward scope of the issue [if an ah-ha moment isn't apparant]. I can see the urgency based on the following IRC snippet:

[21:32:39]	philor	configure: error: Cannot find an llvm-config binary for building a clang plugin
[23:15:37]	Callek	philor: was the above in the wrong channel? if not which job did we break?
[23:16:39]	philor	Callek: Mac static analysis, and it's entirely possible it was ehsan's in-tree part that broke it, or that it's more than just the b2g nightly builds which build on some random cset other than the one on which they claim to have built
[23:16:54]	philor	https://treeherder.mozilla.org/#/jobs?repo=b2g-inbound&revision=cff11d5366a7
[23:17:11]	-->|	gbrown (gbrown@moz-gahfeo.cg.shawcable.net) has joined #releng
[23:17:13]	Callek	thanks, I'll make sure it gets looked into within the next 24 hours
[23:17:22]	philor	which is failing like a twig where I didn't merge what he landed on m-c, but well above it
[23:17:54]	philor	and other trees are not failing
[23:18:36]	philor	and other pushes on that same tree, but above that one are not failing
[23:22:39]	philor	wonder how many "still failing, that backout needed a clobber" "that absolutely shouldn't have needed a clobber" things would be explained by bug 1123025 not being restricted to just b2g emulator nightlies
Flags: needinfo?(catlee)
c.f. Bug 1123342 c#6 for the static-analysis issue mentioned above, just to keep the issues seperate.
philor, reading c#0 and the linked logs, neither me nor catlee can derive the conclusion you did. Can you provide a bit more detail on how you came to that?

Ryan, have you/other sheriffs seen this issue as well and could help provide more details?

Thank You
Flags: needinfo?(ryanvm)
Flags: needinfo?(philringnalda)
Flags: needinfo?(catlee)
I have no context on this at all. Better to let philor explain what he means.
Flags: needinfo?(ryanvm)
If b2g emulator reftest-20 hits a "test-image-layers-multiple-displayitem.html | failed reftest-no-paint" then you know that you have built on a revision at or after https://hg.mozilla.org/mozilla-central/rev/daf8243cd190 which introduced that problem.

If b2g emulator reftest-20 reports that as being a failed test, then you know that you have built on a revision before https://hg.mozilla.org/mozilla-central/rev/0c2393315416, because that revision marked the test as random-if(B2G).

Both of those revisions were in the middle of a merge to mozilla-central, two or three merges to mozilla-central before https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=c9162436444e where we built a nightly (confusingly, we started it before the on-push build, so it finished first and the very first reftest-20 is from it) which both did have the test problem and did report it as a failure, and thus appears to have been built from a revision in the middle of a push several before the one it claims, and we also built an on-push build which exhibits the correct behavior for having been built on the revision it claims, saying "(EXPECTED RANDOM)" for that test.

The other one is the same sort of thing, except that it apparently built in the middle of a bunch of csets that were pushed to mozilla-inbound all in one push, so I don't have m-i failures to point at and say "see, it has to have been after this one and before that one." Someone with too much time on their hands and too little desire to save resources could push all heycam's stuff from https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=b86864fd9d60 to Try one cset at a time, to find out exactly where in those patches it would be possible to build and get layout/inspector/tests/test_bug1006595.html to expect the 'padding' property to have 8 subproperties, but for it to only actually have 4.

The easy out would be to claim that this means b2g doesn't properly rebuild tests, since for the reftest case all it takes to fail is to have an outdated reftests.list manifest file, and probably for the mochitest case it's just that the test files didn't have updated code while the actual code had been changed.

But, don't we clobber nightlies? Where would a nightly be getting an outdated reftests.list to use, other than by having a repo which was actually sitting at a revision before the one where it should be?
Flags: needinfo?(philringnalda)
So the nightly builds upload the emulator to http://pvtbuilds.pvt.build.mozilla.org/pub/mozilla.org/b2g/nightly/mozilla-central-emulator/latest/emulator.tar.gz

Is it possible that multiple things are uploading there at the same time?
Mmm, nice one! I think you want to go for it being the tests.zip that's overwritten, though: the failures fit better for "built the right thing, but tested it with tests from several pushes before" than the other way around, since wrong-build does require things which should not exist, builds from the middle of pushes, while wrong-tests works fine with the test zips that should exist but are just from a prior build.
Or rather than overwritten, just isn't being served yet would work fine, too.
Severity: critical → normal
Summary: b2g emulator nightlies are built on a random previous Gecko revision from the middle of a previous push, not the one they claim → b2g emulator nightlies (sometimes?) use a test package from a previous nightly
bug 1134966 seems like a failure that would result from using old code and new tests.

Either problem seems like something that could be caused by running tests on build and test packages downloaded from a directory called "latest", which is what seems to be happening for tests run on nightlies (also see comments in that bug).
Found the cause of this, patch incoming.
Assignee: nobody → nthomas
Priority: -- → P2
Looks like an error in this changeset  http://hg.mozilla.org/build/mozharness/rev/552c85b84fe9

The emulator builds define upload_remote_nightly_path twice, instead of setting upload_remote_nightly_symlink like everything else. Once we fix this up we'll start doing sendchanges using dated urls, where the content isn't changing.

We'll need to logon to pvtbuilds and remove all the latest dirs after this lands and before another nightly comes along.
Attachment #8571689 - Flags: review?(jlund)
Comment on attachment 8571689 [details] [diff] [review]
[mozharness] Symlink latest instead of writing to it directly

Review of attachment 8571689 [details] [diff] [review]:
-----------------------------------------------------------------

nice catch
Attachment #8571689 - Flags: review?(jlund) → review+
Comment on attachment 8571689 [details] [diff] [review]
[mozharness] Symlink latest instead of writing to it directly

Landed on default:
https://hg.mozilla.org/build/mozharness/rev/9829179fd77e

This won't go into production until testing/mozharness/mozharness.json is bumped, but the reconfig script will claim otherwise.
Attachment #8571689 - Flags: checked-in+
Blocks: 1140864
Are there any plans to integrate this kind of change with reconfigs ? Say landing on mozilla-inbound
Attachment #8574454 - Flags: review?(jlund)
Comment on attachment 8574454 [details] [diff] [review]
[gecko] Bump in-tree mozharness rev

Review of attachment 8574454 [details] [diff] [review]:
-----------------------------------------------------------------

maybe releng (at the least) shouldn't have to get reviews for this but instead just ping buildduty when they are landing such a change
Attachment #8574454 - Flags: review?(jlund) → review+
(In reply to Nick Thomas [:nthomas] from comment #15)
> Created attachment 8574454 [details] [diff] [review]
> [gecko] Bump in-tree mozharness rev
> 
> Are there any plans to integrate this kind of change with reconfigs ? Say
> landing on mozilla-inbound

that might be a better way to do it so we don't end up with a queue of changes landing at once every few days (particularly on m-c).

coop: rather than removing the default -> prod merge for mh during reconfigs, maybe we should do something like nick suggests ^ ? Will discuss at tomorrow's buildduty mtg'
Flags: needinfo?(coop)
Comment on attachment 8574454 [details] [diff] [review]
[gecko] Bump in-tree mozharness rev

https://hg.mozilla.org/integration/b2g-inbound/rev/738109e4e80e

Investigating what cleanup will be needed and when.
Attachment #8574454 - Flags: checked-in+
FYI, my landing got superceded by https://hg.mozilla.org/integration/mozilla-inbound/rev/1d8fe559384e, sheriffs are going to take that when the merge fails.

Cute side-effect of this bug is that there is no archive of emulator nightlies, we've been overwriting the latest directory twice a day.
A test nightly on m-c worked fine. I'd moved the nightly/<branch>-<platform>/latest/ to a nightly/<branch>-<plaform>/2015/03/2015-03-09-16-XX-YY dir, for each combo; except ash, and not caring to make the XX-YY exactly right for each build.

On mozilla-b2g30_v1_4 we're hitting Gu and Gip failures on linux64 desktop:
INFO -  Exception: TypeError: redeclaration of variable files
INFO -  @file:///builds/slave/test/gaia/build/webapp-zip.js:365:NaN
INFO -  @file:///builds/slave/test/gaia/build/multilocale.js:7:19
INFO -  @file:///builds/slave/test/gaia/build/utils-xpc.js:3:21
INFO -  @file:///builds/slave/test/gaia/build/utils.js:10:11
INFO -  @file:///builds/slave/test/gaia/build/preferences.js:4:13
INFO -  @-e:1:7

http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-b2g30_v1_4-linux64_gecko/1425961400/mozilla-b2g30_v1_4_ubuntu64_vm-b2gdt_test-gaia-unit-bm115-tests1-linux64-build2.txt.gz
http://ftp.mozilla.org/pub/mozilla.org/b2g/nightly/2015/03/2015-03-10-00-02-04-mozilla-b2g30_v1_4/mozilla-b2g30_v1_4_ubuntu64_vm-b2gdt_test-gaia-ui-test-bm114-tests1-linux64-build5.txt.gz

Probably from the newer xulrunner we picked up by moving from mozharness f2c783118c6f (~1 month old). Just backed out, and undid the changes on pvtbuilds.

All done here.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
(In reply to Jordan Lund (:jlund) from comment #17)
> coop: rather than removing the default -> prod merge for mh during
> reconfigs, maybe we should do something like nick suggests ^ ? Will discuss
> at tomorrow's buildduty mtg'

Closing the loop here, we decided *not* to pursue this, considering that in-tree mozharness will be a thing soon enough. We opted instead to use clearer bug comments for mozharness production updates made during reconfigs.
Flags: needinfo?(coop)
(In reply to Nick Thomas [:nthomas] from comment #21)
> https://hg.mozilla.org/releases/mozilla-b2g30_v1_4/rev/a244a27343a5

RyanVM reports this was backed out due to unrelated bustage, and now it has relanded and is causing errors
  rm: cannot remove `/pub/mozilla.org/b2g/nightly/mozilla-b2g30_v1_4-emulator/latest': Is a directory

I've removed /pub/mozilla.org/b2g/nightly/mozilla-b2g30_v1_4*/latest to fix it up.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: