Closed Bug 1316564 Opened 8 years ago Closed 7 years ago

[tier-3] Intermittent test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Additional update found due to watershed release 51.0a2

Categories

(Testing :: Firefox UI Tests, defect)

Version 3
defect
Not set
normal

Tracking

(firefox-esr52 fixed, firefox53+ wontfix, firefox54 fixed, firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr52 --- fixed
firefox53 + wontfix
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: whimboo)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 obsolete file)

I think that this is a side-effect from bug 1314550 now for nightly and aurora builds. Underlying reason should be a possible second nightly build on that channel within a day, and the delayed offering for partial updates due to slow processing of funsize tasks (bug 1284516).

There are two things to think about:

1. We should make the failure message more general. Which means that we do not always talk about watershed but just that another update got offered. We can then add the version and buildid of the currently updated version to at least being able to more easily verify that situation.

2. Bug 1285340 which covers a different expected target build id could be related here or even be the underlying issue. Given that we moved the check for further updates up by some lines we do it first now. See https://hg.mozilla.org/mozilla-central/rev/fa64968dc3a1. But I'm not fully sure about that theory because even when our tests run late, Balrog should offer us an update to the most recent version immediately. Maybe those two affected update tests were running at a time when the new updates were close to be ready, so we see an overlap.

I wonder if we could simply ignore further updates in case of a target version and buildid have been given and those match the ones for the updated builds. For beta and release builds it would mean that for testing watersheds we do not get the failure anymore, which shows us that other updates are available. But I doubt that this should be a problem, and we can trigger a second ondemand run for watersheds only. With that we would have green results even - failures are always strange.

Rail and Florin, what do you think?
Flags: needinfo?(rail)
Flags: needinfo?(florin.mezei)
(In reply to Henrik Skupin (:whimboo) from comment #1)
> 
> I wonder if we could simply ignore further updates in case of a target
> version and buildid have been given and those match the ones for the updated
> builds. For beta and release builds it would mean that for testing
> watersheds we do not get the failure anymore, which shows us that other
> updates are available. But I doubt that this should be a problem, and we can
> trigger a second ondemand run for watersheds only. With that we would have
> green results even - failures are always strange.
> 
> Rail and Florin, what do you think?

I agree with your statement here - I think it's fine to ignore further updates.
Flags: needinfo?(florin.mezei)
Sounds sane to me. I think with this approach you can actually test the watersheds as well. We would just need to schedule multiple tests. For example, if we have a watershed at 47.0.2 for windows, then we can schedule 2 tests: 1) 46.0 -> 47.0.2 2) 47.0.2 -> 49.0.2 (latest).
Flags: needinfo?(rail)
Summary: Intermittent test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Additional update found due to watershed release 51.0a2 → Intermittent test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Additional update found due to watershed release 51.0a2
We see a large increase of those failures lately and they even made it into beta now for Firefox 53.0b7:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=46d5fe92c82d&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-beta-cdntest(

So specifically fallback update tests are affected here. As the log show the pre build and post build are identically. So an upgrade did not really happen:

07:18:32     INFO - [{'build_post': {'buildid': u'20170321004003',
07:18:32     INFO -                  'channel': u'aurora',
07:18:32     INFO -                  'disabled_addons': None,
07:18:32     INFO -                  'locale': u'de',
07:18:32     INFO -                  'mar_channels': set(['firefox-mozilla-aurora']),
07:18:32     INFO -                  'update_snippet': '<?xml version="1.0"?>\n<updates>\n    <update type="minor" displayVersion="54.0a2" appVersion="54.0a2" platformVersion="54.0a2" buildID="20170324004022">\n        <patch type="complete" URL="https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20170324004022/Firefox-mozilla-aurora-54.0a2-win32-de.complete.mar?versionId=7PqVcXh7_Mnu6EFXCoTtjcXlaPqpW7tj" hashFunction="sha512" hashValue="25b5f09c3ef977dacb6a67a3968b83031aa7a08df36ff1e6dccfb2d9f0a1fd026cdaf3556855ffe10cd9d45ac21e5386bfbb3fd37e5b5f074e42a2e089e70292" size="53720874"/>\n        <patch type="partial" URL="https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20170324004022/Firefox-mozilla-aurora-54.0a2-win32-de-20170321004003-20170324004022.partial.mar?versionId=MEsbg4wmbbelCIvFBUnZB6NMbz6reZda" hashFunction="sha512" hashValue="7d126dcaedce229d7a03230f80e048b9dc091a1a29fd6425618d01959e896c0a49387f3b1cea8c287f7d5e7815d4b7776de881e608c6af4b46dd781085bfbd83" size="11158787"/>\n    </update>\n</updates>',
07:18:32     INFO -                  'update_url': u'https://aus5.mozilla.org/update/6/Firefox/54.0a2/20170321004003/WINNT_x86-msvc-x64/de/aurora/Windows_NT%206.3.0.0%20(x64)(noBug1296630v1)(nowebsense)/SSE3/default/default/update.xml?force=1',
07:18:32     INFO -                  'user_agent': u'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0',
07:18:32     INFO -                  'version': u'54.0a2'},
07:18:32     INFO -   'build_pre': {'buildid': u'20170321004003',
07:18:32     INFO -                 'channel': u'aurora',
07:18:32     INFO -                 'disabled_addons': None,
07:18:32     INFO -                 'locale': u'de',
07:18:32     INFO -                 'mar_channels': set(['firefox-mozilla-aurora']),
07:18:32     INFO -                 'update_snippet': '<?xml version="1.0"?>\n<updates>\n    <update type="minor" displayVersion="54.0a2" appVersion="54.0a2" platformVersion="54.0a2" buildID="20170324004022">\n        <patch type="complete" URL="https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20170324004022/Firefox-mozilla-aurora-54.0a2-win32-de.complete.mar?versionId=7PqVcXh7_Mnu6EFXCoTtjcXlaPqpW7tj" hashFunction="sha512" hashValue="25b5f09c3ef977dacb6a67a3968b83031aa7a08df36ff1e6dccfb2d9f0a1fd026cdaf3556855ffe10cd9d45ac21e5386bfbb3fd37e5b5f074e42a2e089e70292" size="53720874"/>\n        <patch type="partial" URL="https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20170324004022/Firefox-mozilla-aurora-54.0a2-win32-de-20170321004003-20170324004022.partial.mar?versionId=MEsbg4wmbbelCIvFBUnZB6NMbz6reZda" hashFunction="sha512" hashValue="7d126dcaedce229d7a03230f80e048b9dc091a1a29fd6425618d01959e896c0a49387f3b1cea8c287f7d5e7815d4b7776de881e608c6af4b46dd781085bfbd83" size="11158787"/>\n    </update>\n</updates>',
07:18:32     INFO -                 'update_url': u'https://aus5.mozilla.org/update/6/Firefox/54.0a2/20170321004003/WINNT_x86-msvc-x64/de/aurora/Windows_NT%206.3.0.0%20(x64)(noBug1296630v1)(nowebsense)/SSE3/default/default/update.xml?force=1',
07:18:32     INFO -                 'user_agent': u'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0',
07:18:32     INFO -                 'version': u'54.0a2'},
07:18:32     INFO -   'fallback': True,
07:18:32     INFO -   'patch': {'buildid': u'20170324004022',
07:18:32     INFO -             'channel': u'aurora',
07:18:32     INFO -             'download_duration': 2.213,
07:18:32     INFO -             'is_complete': True,
07:18:32     INFO -             'size': 53720874,
07:18:32     INFO -             'type': u'minor',
07:18:32     INFO -             'url_mirror': u'https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20170324004022/Firefox-mozilla-aurora-54.0a2-win32-de.complete.mar?versionId=7PqVcXh7_Mnu6EFXCoTtjcXlaPqpW7tj',
07:18:32     INFO -             'version': u'54.0a2'},
07:18:32     INFO -   'success': False,
07:18:32     INFO -   'update_log': ['Performing a staged update',
07:18:32     INFO -                  'PATCH DIRECTORY C:\\Users\\mozauto\\AppData\\Local\\Mozilla\\updates\\5FC44D6E9F1A2441\\updates\\0',
07:18:32     INFO -                  'INSTALLATION DIRECTORY c:\\jenkins\\workspace\\mozilla-aurora_update\\build\\application.copy',
07:18:32     INFO -                  'WORKING DIRECTORY c:\\jenkins\\workspace\\mozilla-aurora_update\\build\\application.copy\\updated',
07:18:32     INFO -                  'Update already in progress! Exiting']}]

So as it looks like there is an active update already in progress.

All this affects only Windows but not Linux and OS X. Matt, do you know of any changes for the updater on Windows which might have caused those failures? It's not always happening but with a higher frequency compared to passing tests.
Flags: needinfo?(mhowell)
Maybe this is not related to any changes for the updater but bug 1333014. So it would be a different test failure than this one. I will file a new bug once it's clearer to me.
Flags: needinfo?(mhowell)
Just for the record, I did take a look in the logs and I don't see any substantial Windows-specific changes in the updater since August (bug 1246972). The only recent change was a small fix for a moderate security issue (bug 1321814), but if that were causing problems I really don't think they would look like this failure.
For the record, the issue started showing today. Didn't encounter this problem for 52.0.2 and 52.0.2 ESR at all yesterday when testing on esr-localtest, or release-localtest, with these same versions. 53b7 and 52.0.2 ESR have had some serious problems today on beta-cdntest and esr-cdntest.

Note that the problem was also seen on release today, but seemed to have a diminished impact, as the tests passed on next re-runs.
This is really a very strange behavior. For now I will hijack this bug because it might still be related to what we partially see for aurora and central too.

So we download the partial update, and make it invalid. As result after a restart Firefox complains about it that it cannot be applied and a complete update has to be downloaded. When I click `Next` as usual, the is no download screen but the update is there already. In this case I see now logging info from the updater at all about this download. Which code downloads it? Clicking restart afterward to apply the update, restarts Firefox but without updating it.

What I also noticed are two separate update staging directories which are getting created. One is in `%appdata%/local/Mozilla/update/%hash%` which is empty, and there is another one in `%appdata/local/Mozilla/Firefox/firefox/update` which contains the downloaded mar files.

Something else I noticed when the restart is happening the update should be applied is that the last-update.log afterward says that it tried to apply a partial update (UPDATE TYPE partial). And at the end it says succeeded. Whatever we do here is clearly wrong.
Flags: needinfo?(mhowell)
Btw, I also tried this manually now and I can also reproduce it. We do definitely not upgrade Firefox after a restart in such a situation.
Ok, so in the manual case I might know what's going on. It may also apply to the update tests.

After downloading the partial update we apply the update immediately. A restart of Firefox would simply swap the versions, and immediately use the new one. When I now mark the update.status file as "failed:6" AFTER the update has been applied, Firefox still knows about the updated version after a restart, but doesn't use it. As such the files are remaining in that subfolder. Now trying to apply the complete mar fails because there is still an update process active.

I think that we have to change our steps in how to test fallbacks, or turn off applying updates before Firefox starts.
What I really wonder is why the heck we have all those failures now! The underlying code to handle the applying step are in our tests since ages and it never failed since recently.
There was a recent change (bug 1348609) which should have made us prefer the path "%appdata%/local/Mozilla/update/%hash%" and never use the other one. I'm not immediately certain whether a problem with that patch would be able to cause these symptoms, but it's the most recent relevant change. Plus that change wasn't uplifted, so it should only be affecting nightly. Still working on it, leaving needinfo in place.
It sounds like this is also affecting beta and ESR. I had a look through the pushlog and the only change to recently hit beta that's even vaguely related to update is bug 1350064, but that's add-on update, and it looks like it didn't even make it in for beta 7 if I'm reading the log correctly. Also it hasn't been uplifted to ESR at all. So I can't identify any app update code change that could be responsible here.
Flags: needinfo?(mhowell)
From talking with everyone testing here, it sounds like we think this is likely to be a test failure across several channels, and we should go ahead with the release as planned. mhowell and whimboo can keep investigating.
Comparing 2 tests for en-US b6 to b7:

green https://firefox-ui-tests.s3.amazonaws.com/6ba85c35-2ef0-4efd-ac41-f6c3f8ac937a/log_info.log

04:41:49     INFO -   'update_log': ['Performing a replace request',
04:41:49     INFO -                  'PATCH DIRECTORY C:\\Users\\mozauto\\AppData\\Local\\Mozilla\\updates\\5B85A4075EAE3983\\updates\\0',
04:41:49     INFO -                  'INSTALLATION DIRECTORY c:\\jenkins\\workspace\\ondemand_update\\build\\application.copy',
04:41:49     INFO -                  'WORKING DIRECTORY c:\\jenkins\\workspace\\ondemand_update\\build\\application.copy\\updated',
04:41:49     INFO -                  'Begin moving destDir (c:\\jenkins\\workspace\\ondemand_update\\build\\application.copy) to tmpDir (c:\\jenkins\\workspace\\ondemand_update\\build\\application.copy.bak)',
04:41:49     INFO -                  'rename_file: proceeding to rename the directory',
04:41:49     INFO -                  'Begin moving newDir (c:\\jenkins\\workspace\\ondemand_update\\build\\application.copy.bak/updated) to destDir (c:\\jenkins\\workspace\\ondemand_update\\build\\application.copy)',
04:41:49     INFO -                  'rename_file: proceeding to rename the directory',
04:41:49     INFO -                  'Now, remove the tmpDir',
04:41:49     INFO -                  'ensure_remove: failed to remove file: c:\\jenkins\\workspace\\ondemand_update\\build\\application.copy.bak/api-ms-win-crt-convert-l1-1-0.dll, rv: -1, err: 13',
04:41:49     INFO -                  'ensure_remove: failed to remove file: c:\\jenkins
......



vs failed: https://firefox-ui-tests.s3.amazonaws.com/b38b3bd5-6665-4dd1-b007-79b2de9196ef/log_info.log

00:59:39     INFO -   'update_log': ['Performing a staged update',
00:59:39     INFO -                  'PATCH DIRECTORY C:\\Users\\mozauto\\AppData\\Local\\Mozilla\\updates\\5B85A4075EAE3983\\updates\\0',
00:59:39     INFO -                  'INSTALLATION DIRECTORY c:\\jenkins\\workspace\\ondemand_update\\build\\application.copy',
00:59:39     INFO -                  'WORKING DIRECTORY c:\\jenkins\\workspace\\ondemand_update\\build\\application.copy\\updated',
00:59:39     INFO -                  'UPDATE TYPE partial',
00:59:39     INFO -                  'PREPARE PATCH xul.dll',
00:59:39     INFO -                  'PREPARE PATCH wow_helper.exe',
00:59:39     INFO -                  'PREPARE PATCH voucher.bin',

....


The latter log looks like it applies a partial MAR, but it probably shouldn't?
Those two logs are from different parts of the update procedure. The green run also starts off by applying a partial, because that's how the test works. But the failed run didn't get as far as the step that's logged in the green run's update log before failing out for still mysterious reasons.
Yeah, most likely this is the case here because we do not upgrade Firefox. So the last-update.log is not getting updated. I'm off for the rest of the week, so I cannot investigate anything until Monday.
I've been trying to repro this locally on beta, and could not. Tried both actually running the test and also performing the steps manually; no failure either way.

I do get a test failure on aurora, but I don't think it's for the same reason. What I'm seeing is that the partial MAR at
https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20170329004027/Firefox-mozilla-aurora-54.0a2-win32-en-US-20170328004003-20170329004027.partial.mar?versionId=8DuWJW3ppUzMQPjxtEuptdRO_RMmCDpP
doesn't apply cleanly. I get this in my update.log:

Performing a staged update
PATCH DIRECTORY C:\Users\Matt Howell\AppData\Local\Mozilla\updates\26E70F8ABC2D3A34\updates\0
INSTALLATION DIRECTORY C:\Users\Matt Howell\Downloads\firefox
WORKING DIRECTORY C:\Users\Matt Howell\Downloads\firefox\updated
UPDATE TYPE partial
PREPARE PATCH xul.dll
[...]
EXECUTE PATCH omni.ja
LoadSourceFile: destination file size 9799029 does not match expected size 9799042
LoadSourceFile failed
### execution failed

which means that the size of the old omni.ja wasn't the same as what the patch file said to expect. I verified myself that the size in the MAR is actually wrong, so the updater didn't parse it incorrectly, the MAR itself was somehow not generated correctly.
Note that the above does not break normal updates, because the update service immediately falls back to the complete MAR.
Some late-to-the-party suggestions - rerun the 52.0.2 tests that passed earlier to see if they fail now. This is complicated by the change to the websense rules.

For our convenience:
52.0.2 - https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=e81854d6ce91f3174774a50c9c5813c33b9aac58&filter-tier=1&filter-tier=2&filter-tier=3&filter-searchStr=update
53.0b7 - https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=46d5fe92c82d36b922703faeb7dd3a17b5df55e1&exclusion_profile=false&filter-tier=1&filter-tier=2&filter-tier=3&filter-searchStr=update

(In reply to Henrik Skupin (:whimboo) from comment #13)
> All this affects only Windows but not Linux and OS X. Matt, do you know of
> any changes for the updater on Windows which might have caused those
> failures? It's not always happening but with a higher frequency compared to
> passing tests.

Are we using the maintenance service of Windows ? If the machines are reused for many tests, is it possible that the app and service are out of sync  ? This could lead to test-history dependent results, like a change from pass to fail if a nightly test sneaks in between the 52.0.2 and 53.0b7/52esr tests.
The logs indicate that the maintenance service is being used; it's not really needed for these tests to work, but it's probably good to have it so that we more closely simulate a "real" update, which would generally go through the service.

I don't think this is the issue though; the service and the application don't really need to be in sync by specific version, just their code signing certificates need to match up, and those haven't changed since 43.
Comment 27 was a red herring; the same fallback mechanism that rescues normal updates also kicks in to save this test.
The failure I saw earlier appears to have happened because I forgot to update my tree to aurora before switching to running the test against an aurora build. That means I do not actually have a local repro at all. :(
It looks as if after the first update is applied and the browser is shut down, there isn't any logging from the new instance, either from Marionette or gecko. That's making it pretty much impossible to tell anything from the logs, because the first update staging appears to be going fine (as in, looks identical to a passing run).
[Tracking Requested - why for this release]: having our update tests routinely fail for every beta and esr build can't really go on for too long...
(In reply to Nick Thomas [:nthomas] from comment #29)
> Are we using the maintenance service of Windows ? If the machines are reused
> for many tests, is it possible that the app and service are out of sync  ?

I uninstalled the maintenance service while testing this last week on one of the machines. The problem was still persistent afterward. So I agree that disabling it shouldn't be an option right now, also given that we would loose coverage.

I asked Florin to re-run updates for source builds from 53.0b1 to 53.0b4. As it looks like those never raised those issues.

What I can do later is to add some logs to the bug with only the fallback update test running and finally failing. Maybe that helps a bit more compared to the logs on Treeherder which also include the partial update tests.
Flags: needinfo?(florin.mezei)
(In reply to Henrik Skupin (:whimboo) from comment #38)
> I asked Florin to re-run updates for source builds from 53.0b1 to 53.0b4. As
> it looks like those never raised those issues.

I've re-run today all update tests for 53b8, on the beta channel, using only source builds < 53.0b5. There were no failures encountered whatsoever, on either Linux or Windows.
Flags: needinfo?(florin.mezei)
Interesting. Can you repeat now with b5 and b6 => b8? I would like to know if those are also working now, or if those are still affected. If the latter is the case we have to check what landed between b4 and b5. Last time we actually only checked the changelog between b6 and b7, but as it looks like the failure came in already for b5 but was somewhat not visible at this time.
Flags: needinfo?(florin.mezei)
(In reply to Henrik Skupin (:whimboo) from comment #40)
> Interesting. Can you repeat now with b5 and b6 => b8? I would like to know
> if those are also working now, or if those are still affected. If the latter
> is the case we have to check what landed between b4 and b5. Last time we
> actually only checked the changelog between b6 and b7, but as it looks like
> the failure came in already for b5 but was somewhat not visible at this time.

Did this just now, and got 5 failed jobs out of 44, all on Windows (not classified in Treeherder):
- Win 7, 32-bit: en-US-53.0b6, fr-53.0b5
- Win 7, 64-bit: es-MX-53.0b5
- Win 8, 32-bit: en-US-53.0b6
- Win 8, 64-bit: zh-CN-53.0b5

Treeherder run: https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=0a36cc120d6e&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-beta(
Flags: needinfo?(florin.mezei)
Ok, so mind giving us a link with the hg pushlog for all the changes between 53.0b4 and 53.0b5?
(In reply to Henrik Skupin (:whimboo) from comment #42)
> Ok, so mind giving us a link with the hg pushlog for all the changes between
> 53.0b4 and 53.0b5?

https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=484f49860df6&tochange=eb20f71a4f43
Nothing in this pushlog actually shows something suspicious to me.

So I think we have to iterate step-wise. Therefor I want to re-ask the question from comment 19. Matt, can you please check this again and tell me if we should avoid using staged updates at least in the case of fallback updates?
Flags: needinfo?(mhowell)
Re: comment 19
The updater doesn't consider it a problem when the directory it's going to try and stage into already exists; it just deletes the existing directory before it starts patching. See [0]. So I don't think that's the issue.

[0] https://dxr.mozilla.org/mozilla-central/rev/38894655c89e68bcd8f45d31a0d3005f2c2b53db/toolkit/mozapps/update/updater/updater.cpp#3366
Flags: needinfo?(mhowell)
Matt, not sure if you gave the correct link here. The code I'm referring to is the one at this stage:

1) partial update has been downloaded
2) partial update has been correctly applied 
3) update.status file got modified with `failed:6\n`
4) Firefox restarts

What happens with step 4? The partial has been successfully applied and a new subdir is still around under the installation dir which contains the updated Firefox version. So my questions:

1) Where and when is this subdir getting deleted?
2) Where is the complete patch getting downloaded? We don't see anything in the software update window. It's just there already. Maybe the download is really that fast?
Flags: needinfo?(mhowell)
(In reply to Henrik Skupin (:whimboo) from comment #46)
> 1) Where and when is this subdir getting deleted?

I was answering the theory put forward in comment 19, which ends with:
> As such the files are remaining in that subfolder. Now trying to apply the complete mar fails because there is still an update process active.

The updater binary is the only thing that ever works with the staging directory, and the first thing it does there is to either create that directory if it doesn't exist or empty it out if it does. That's the code at the link I posted. So the fact that there's still a "failed" copy in there is never a problem; it's never even really noticed.

> 2) Where is the complete patch getting downloaded? We don't see anything in
> the software update window. It's just there already. Maybe the download is
> really that fast?

It starts downloading when the initial window appears, even though you don't see a progress bar until after you click OK. But note that, because of the situation I talked about in comment 27, there's am exception to the usual flow right now just for aurora. Since it's a complete update that we're setting to failed in that case, and not a partial as it would usually be, nothing is downloaded initially (because it's the complete update that would be the fallback) until the about box is brought up again to start the manual check.
Flags: needinfo?(mhowell)
While running ondemand update tests on beta-cdntest for 53 Beta 9 [1], I've encountered again this issue for 5 Windows jobs:
- on the first re-build => 1 job passed, the other 4 failed
- on the second re-build - I changed the TEST_PACKAGES_URL to point to the tinderbox builds containing the fix from bug 1351335 ([2] and [3]) => all 4 tests passed

So it would seem like the fix from bug 1351335 has helped with these test failures.

[1] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=3a2e5c6e4ed3&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-beta-cdntest(
[2] - https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-beta-win32/1491229643/firefox-53.0.en-US.win32.test_packages.json
[3] - https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-beta-win64/1491229643/firefox-53.0.en-US.win64.test_packages.json
(In reply to Florin Mezei, QA (:FlorinMezei) from comment #48)
> - on the second re-build - I changed the TEST_PACKAGES_URL to point to the
> tinderbox builds containing the fix from bug 1351335 ([2] and [3]) => all 4
> tests passed
> 
> So it would seem like the fix from bug 1351335 has helped with these test
> failures.

That would be interesting because I don't really see a relationship here. So could you please retrigger those passing tests with the other test package a couple of times? Maybe 10? I would like to know the failure rate. Thanks.
(In reply to Henrik Skupin (:whimboo) from comment #49)
> That would be interesting because I don't really see a relationship here. So
> could you please retrigger those passing tests with the other test package a
> couple of times? Maybe 10? I would like to know the failure rate. Thanks.

I've retriggered all 4 jobs that passed with the modified TEST_PACKAGES_URL, but this time with the original value (that does not have the fix from bug 1351335), 3 times each (so a total of 12 jobs) - I got 11 failures for a failure rate of 92%.
In reply to Florin Mezei, QA (:FlorinMezei) from comment #48)
> So it would seem like the fix from bug 1351335 has helped with these test
> failures.

After running the ondemand update tests on the beta channel for Firefox 53 beta 9, it seems that my theory above does not stand, as I still got a bunch of failures even with the modified TEST_PACKAGES_URL. It seems that was just a very lucky coincidence on beta-cdntest, and Henrik was right to think that's not related to this issue.
(In reply to Henrik Skupin (:whimboo) from comment #1)
> I think that this is a side-effect from bug 1314550 now for nightly and
> aurora builds. Underlying reason should be a possible second nightly build
> on that channel within a day, and the delayed offering for partial updates
> due to slow processing of funsize tasks (bug 1284516).
> 
> There are two things to think about:
> 
> 1. We should make the failure message more general. Which means that we do
> not always talk about watershed but just that another update got offered. We
> can then add the version and buildid of the currently updated version to at
> least being able to more easily verify that situation.
> 
> 2. Bug 1285340 which covers a different expected target build id could be
> related here or even be the underlying issue. Given that we moved the check
> for further updates up by some lines we do it first now. See
> https://hg.mozilla.org/mozilla-central/rev/fa64968dc3a1. But I'm not fully
> sure about that theory because even when our tests run late, Balrog should
> offer us an update to the most recent version immediately. Maybe those two
> affected update tests were running at a time when the new updates were close
> to be ready, so we see an overlap.
> 
> I wonder if we could simply ignore further updates in case of a target
> version and buildid have been given and those match the ones for the updated
> builds. For beta and release builds it would mean that for testing
> watersheds we do not get the failure anymore, which shows us that other
> updates are available. But I doubt that this should be a problem, and we can
> trigger a second ondemand run for watersheds only. With that we would have
> green results even - failures are always strange.
> 
> Rail and Florin, what do you think?

Given that we high-jacked this bug for the recent update test failures on beta I'm going to clone the bug to get the above done separately.
Ok, so I did a lot of testing on staging today for the latest Firefox 53.0 beta 9 and closely observed every single update test job on the Windows machine. By doing that I was able to see a substantial difference between a passing and a failing update test:

Passing test:
=============

We download the partial update, which gets applied. Then we make it invalid by storing `failed:6\n` in update.status. After a restart Firefox displays the old software update dialog with the failure message. Clicking on next closes the dialog, and opens the about window, which is used to download the complete mar file. A restart of Firefox results in a successful update.

Failing test:
=============

We download the partial update, which gets applied. Then we make it invalid by storing `failed:6\n` in update.status. After a restart Firefox also displays the old software update dialog with the failure message. Clicking on next, does NOT close the window, but keeps it open with a restart button. We somehow do not fail in not being able to interact with the about window, and restart Firefox. Strangely the update is not getting applied this way.



So in the failing case we have the situation that the complete mar file is downloaded that fast, that it is already stored locally when we hit the `next` button in the old software update dialog. See comment 17.

Also when I check the results I can see that the failure is ONLY happening when we originally serve a partial update, but not a complete one. For beta9 all source builds earlier than beta6 are getting a complete update offered only. And what works fine. 

When downloading an update via the old software update window our tests are modifying the pref `app.update.altwindowtype` and set its value to the software update dialogs window type. 

https://dxr.mozilla.org/mozilla-central/rev/b043233ec04f06768d59dcdfb9e928142280f3cc/testing/firefox-ui/harness/firefox_ui_harness/testcases.py#187

In line 217 we check if the update is already downloaded which would be the case in the above situation, and we return to `download_and_apply_forced_update()` after resetting the preference. Finally the restart is triggered, still not via the restart button due to issues with builds < Firefox 52 (see bug 1304656 comment 7).

Matt, do you see anything obviously wrong with those lines of code, or using this preferences?
Flags: needinfo?(mhowell)
Thanks for doing what must have been a really tedious analysis to get that info. Using it, I've reproduced _a_ bug locally, but I'm not certain if it's the same one that you saw. Here's the process I'm seeing:

1) Starting from 53.0b8, update via the about window and close Firefox instead of clicking the restart button. This results in a partial update being successfully staged.
2) Overwrite the update.status file with "failed: 6".
3) Start the browser. Observe the "Update Failed" dialog, but don't click OK yet. The complete update downloads and stages while this window is showing.
4) Click OK, but only _after_ the update is downloaded and staged.
5) The progress window that appears next is broken; it hangs on "Connecting to the update server..."
6) Open the About window and click the restart button that is already showing there.
7) Observe that the update has not been applied.

The reason the update hasn't been applied is that step 4 causes the update.status file to be overwritten with "null". This is clearly a bug.

Henrik, from what you observed, does it seem plausible that the download of the complete update in step 3 is running so fast that it's done before the OK button gets clicked? If so, that would seem to explain the failures.

Right now I don't see any reason to think that the altwindowtype pref is causing trouble; it wouldn't affect the about window, and I'm not setting it for my manual tests.
Flags: needinfo?(mhowell) → needinfo?(hskupin)
What I also noticed is that the fallback update tests for initial partial mars are passing if the complete update is not getting downloaded that fast! When the software update dialog shows the progress bar it works all fine. It only fails when the MAR is already in the cache.
(In reply to Matt Howell [:mhowell] from comment #55)
> Henrik, from what you observed, does it seem plausible that the download of
> the complete update in step 3 is running so fast that it's done before the
> OK button gets clicked? If so, that would seem to explain the failures.

Yes, that is exactly what I see on our Windows boxes when the test failure is happening! The situation might not occur in the wild but for our tests its annoying. Knowing this problem, might you know why it has been regressed lately? I assume you have to do more investigation before you can give an answer.
Flags: needinfo?(hskupin)
I'm not certain why this would have regressed recently. I don't see any recent enough changes to the updater UI/frontend code, except things like eslint fixes that shouldn't make any difference. There could have been a change somewhere else that ended up altering the caching behavior we get so as to make the download much faster, or something like that.

If we're right about this being the cause, then I think I have a patch that should fix it, by skipping the progress page that's not expecting to find an update already staged and going straight to the restart prompt instead. I'll post that shortly.
Attachment #8855063 - Attachment is obsolete: true
Attachment #8855063 - Flags: review?(robert.strong.bugs)
Filing a new bug for the above patch, since it may turn out to be unrelated to this test failure.
The patch discussed above (bug 1353917) is currently on autoland, so it should be in tonight's nightly. Henrik, what needs to happen to be able to verify whether that patch makes a difference for this failure? I don't feel great about just pushing it to beta and ESR before we actually know it fixes anything.
Flags: needinfo?(hskupin)
If it will be in the next nightly build on central, we would hopefully see a difference on Saturday (+2 days), because we need this nightly as source build, and the over-next nightly as target build.

On central we only run en-US builds for updates, which means only 4 updates per platform version. But on Aurora we have 5 locales, which means 20 updates per platform version. I would suggest we get it uplifted to aurora and re-check everything on Monday. If the watershed test failures have stopped on central and aurora it should be a good indication. But as said earlier, I'm not sure if those are identical. Anyway... if all works fine on both channels over the weekend, I would have a good feeling to get it uplifted to beta. Do you have specific unit tests too, which can check that?
Flags: needinfo?(hskupin)
We do have unit tests that cover this code. But of course those are just unit tests, they don't verify the entire fallback procedure.

On your suggestion, I'll request aurora uplift, and we can see what we've got on Monday.
I think something else we should do here is a check that Firefox has really been upgraded after the restart. This would have given us a way better failure message and lesser confusion. I will try to get a test for this added today.
Depends on: 1355009
(In reply to Henrik Skupin (:whimboo) from comment #53)
> Given that we high-jacked this bug for the recent update test failures on
> beta I'm going to clone the bug to get the above done separately.

Missed to add that this is bug 1353717 now.
I had a look at Orange Factor for this bug and this is how it looks like:

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1316564&startday=2016-10-01&endday=2017-04-12&tree=all

So it started November 5th, and recently has a huge spike because those failures are visible on beta. The first failing test was en-US-2, which means the source build was from November 3rd.

Checking a couple of test failures for beta leads all to bug 1355818 which seems to show the exact same underlying issue. It's that the staged update is not correctly applied during a restart of Firefox.

I think we should keep this bug open until the first beta update tests for Firefox 54.0. I strongly believe that we can close it by then and track everything via bug 1355818.
The changelog is:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3e73fd638e687a4d7f46613586e5156b8e2af846&tochange=ade8d4a63e57560410de106450f37b50ed71cca5

So there is bug 1314550 which exactly added this assertion check. But at this time we didn't have issues with applying the update during a restart.

There is also bug 1304656, but it got backed out and never landed again so far. 

Otherwise I don't see anything obvious. The application updater changes are mostly es-lint and are looking fine.
Summary: Intermittent test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Additional update found due to watershed release 51.0a2 → [tier-3] Intermittent test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Additional update found due to watershed release 51.0a2
The real underlying failure (bug 1355818) was hidden by this wrongly placed assertion. So the patch on bug 1355009 fixed that.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Do we need to do something for ESR52 here still?
Flags: needinfo?(hskupin)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #75)
> Do we need to do something for ESR52 here still?

No, the patch which fixed that got already landed on esr52 via bug 1355009.
Flags: needinfo?(hskupin)
Target Milestone: --- → mozilla55
What I don't understand is why this is still a remaining failure for esr52 and especially only for Windows XP. This specific failure message doesn't exist anymore on that branch. So where are those boxes getting this failure from?

When I check the following failure:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-esr52&job_id=100093785

I can see in the mozmill-ci job that the test package refers to:

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-esr52-win32/1495049059/firefox-52.1.2.en-US.win32.test_packages.json

Which then downloads:

https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-esr52-win32/1495049059/firefox-52.1.2.en-US.win32.common.tests.zip

> % grep -r "watershed" tests/firefox-ui/
> tests/firefox-ui//harness/firefox_ui_harness/testcases.py:            # Bug 604364 - We do not support watershed releases yet.
> tests/firefox-ui//harness/firefox_ui_harness/testcases.py:                             'Additional update found due to watershed release {}'.format(

What? Why does this package contain old data? 

I will file a new bug to get this covered.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1366173
We shouldn't see this problem with the 52.2.0 release anymore. But keep in mind that it will be present if another security fix release is shipped from the 52.1.x relbranch.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
We also landed it on the 52.1.x relbranch. So it should be completely fixed even if there is another security release coming up for esr52.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: