1303834 - [tier-3] Intermittent test_fallback_update.py TestFallbackUpdate.test_update (partial MAR) | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!)

Reporter

Description

•

8 years ago

treeherder

Filed by: hskupin [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=5036232&repo=mozilla-central

https://firefox-ui-tests.s3.amazonaws.com/1cf613c4-045e-4935-a0c7-649120b5b75f/log_info.log

This is a follow-up bug for bug 1293404. As it looks like the issue we have initially seen has not been fully fixed yet.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 1

•

8 years ago

From the logs:

10:37:16     INFO -  *** AUS:UI gFinishedPage:not elevationRequired
10:37:16     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_type_add_interface_static: assertion 'g_type_parent (interface_type) == G_TYPE_INTERFACE' failed
10:37:16     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_type_add_interface_static: assertion 'g_type_parent (interface_type) == G_TYPE_INTERFACE' failed
10:37:16     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_type_add_interface_static: assertion 'g_type_parent (interface_type) == G_TYPE_INTERFACE' failed
10:37:21     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_object_ref: assertion 'object->ref_count > 0' failed
10:37:21     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_object_unref: assertion 'object->ref_count > 0' failed
10:37:21     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_object_ref: assertion 'object->ref_count > 0' failed
10:37:21     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_object_unref: assertion 'object->ref_count > 0' failed
10:37:22     INFO -  *** UTM:SVC TimerManager:registerTimer - id: xpi-signature-verification
10:37:22     INFO -  ATTENTION: default value of option force_s3tc_enable overridden by environment.
10:37:22     INFO -  *** AUS:SVC Creating UpdateService
10:37:22     INFO -  *** AUS:SVC readStatusFile - status: applying, path: /home/mozauto/jenkins/workspace/mozilla-central_update/build/application.copy/updates/0/update.status
10:37:23     INFO -  1474306643425	Marionette	ERROR	Error on starting server: [Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]"  nsresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)"  location: "JS frame :: chrome://marionette/content/server.js :: MarionetteServer.prototype.start :: line 85"  data: no]
10:37:23     INFO -  [Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]"  nsresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)"  location: "JS frame :: chrome://marionette/content/server.js :: MarionetteServer.prototype.start :: line 85"  data: no]
10:37:23     INFO -  MarionetteServer.prototype.start@chrome://marionette/content/server.js:85:19
10:37:23     INFO -  MarionetteComponent.prototype.init@resource://gre/components/marionettecomponent.js:218:5
10:37:23     INFO -  MarionetteComponent.prototype.observe@resource://gre/components/marionettecomponent.js:142:7
10:37:23     INFO -  *** UTM:SVC TimerManager:registerTimer - id: browser-cleanup-thumbnails
10:37:29     INFO -  *** AUS:SVC gCanCheckForUpdates - able to check for updates

status-firefox50: --- → affected

status-firefox51: --- → affected

Molly Howell (she/her) (no longer active)

Assignee

Comment 2

•

8 years ago

The readStatusFile call in the above snippet should be reporting that the status is "applied", not "applying". That's what's shown in the logs for passing runs. That means that the failure is caused by the browser process deciding that the updater process has exited before it actually has.

So the question is how that's happening. I don't think waitpid() is throwing any errors because the bug 1293404 patch added a log message for that case, and I don't see that message in the logs (it would have been in the above snippet). I also don't think waitpid() is reporting some state change in the updater other than exiting normally, because there also should have been a log message on that branch even in the original bug 1272614 patch. That doesn't really leave me any ideas.

whimboo, these jobs don't have TaskCluster ID's (I guess they don't run in TaskCluster?) so I can't get loaners from them. I'd really like to point strace at one of these runs so I can see what waitpid() is doing. Do I need to use the "manual" loaner request process?

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 3

•

8 years ago

(In reply to Matt Howell [:mhowell] from comment #2)
> whimboo, these jobs don't have TaskCluster ID's (I guess they don't run in
> TaskCluster?) so I can't get loaners from them. I'd really like to point
> strace at one of these runs so I can see what waitpid() is doing. Do I need
> to use the "manual" loaner request process?

Sorry, but I had to revert mozmill-ci yesterday from using taskcluster because some incompatible changes landed by end of last week. Given that lots of changes are happening right now in terms of task definitions I'm not able to keep up with all of that. I don't want to risk more breakage. So we run all of our tests via our own slaves again.

That shouldn't actually be bad! The benefit is that we always have them available and connections won't drop as it happens for AWS. I will see that I can prepare a machine for your usage. Then you can do whatever you want.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 4

•

8 years ago

So the machine I wanted to setup doesn't show this particular problem. Therefore I prepared one of our slave nodes from the production environment. I would be happy if you can limit your activity to the local environment and folders as listed below to avoid global pollution of the VM. Thanks

Steps in how you can run the tests:
1. Connect to the Mozilla VPN
2. Connect via ssh to mm-ub-1404-64-4.qa.scl3.mozilla.com (ping me on IRC to get the user/pass)
3. Run `screen -x`
4. Run the last commands from bash history

It always reproduces this failure for me with a source build from Sep 19th.

Molly Howell (she/her) (no longer active)

Assignee

Comment 5

•

8 years ago

I've not been able to reach that server through the VPN. I suspected a VPN configuration issue at first, but I've tried from a Mac, Windows, and a Linux system, so I don't think I'm doing the same thing wrong on all three.

On IRC, whimboo suggested I could use an old TaskCluster run but with a newer binary, and I was able to do that. But I'm not any less confused now, unfortunately. strace doesn't show any waitpid calls for the updater that return anything other than 0 (meaning "process still running") so I have no better idea how that loop is getting broken out of prematurely. I may have to push some patches to oak that add some more logging for... something.

Molly Howell (she/her) (no longer active)

Assignee

Comment 6

•

8 years ago

https://hg.mozilla.org/projects/oak/rev/e86e0aeb2dc0ca4c5a3b1281b6be856751655839
Added some more logging for investigating bug 1303834. Don't land this anywhere else.

Molly Howell (she/her) (no longer active)

Assignee

Comment 7

•

8 years ago

https://hg.mozilla.org/projects/oak/rev/664d9b090718
Merge m-c to oak, so that we have a second build to test bug 1303834 against.

Comment hidden (Intermittent Failures Robot)

25 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-aurora: 20
* mozilla-central: 5

Platform breakdown:
* linux32: 18
* linux64: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-09-21&endday=2016-09-21&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 9

•

8 years ago

I requested new nightly builds for both changesets. They should appear soon.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 10

•

8 years ago

FYI currently we restart Firefox via Services.startup.quit(). But via bug 1304656 we now want to change it to use the restart button. Not sure if there would be a different behavior for Firefox between those two methods.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 11

•

8 years ago

Matt, you can get a Nightly here for testing:
https://archive.mozilla.org/pub/firefox/nightly/2016/09/2016-09-21-23-36-38-oak/

Flags: needinfo?(mhowell)

Molly Howell (she/her) (no longer active)

Assignee

Comment 12

•

8 years ago

Our tradition of logs making no sense whatsoever and leaving me utterly confused remains unbroken:

[...]
1928328960[7f9766bba300]: ProcessHasTerminated: Checking state of updater process
1928328960[7f9766bba300]: ProcessHasTerminated: Updater process is still running; waiting 1 second before trying again
1928328960[7f9766bba300]: WaitForProcess: process still running, dispatching myself
*** UTM:SVC TimerManager:registerTimer - id: xpi-signature-verification
ATTENTION: default value of option force_s3tc_enable overridden by environment.
*** AUS:SVC Creating UpdateService
*** AUS:SVC readStatusFile - status: applying, path: /tmp/tmpDYP2gp.application.copy/updates/0/update.status
[...]

So, my newly added logging tells us that WaitForProcess() correctly gets a false return from ProcessHasTerminated() and attempts to dispatch itself so it can check again, but then what actually happens is that UpdateDone() runs. I do not have the slightest idea how that could be possible. The dispatching works correctly a bunch of times before this happens, so it's not like it's just always broken.

Flags: needinfo?(mhowell)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 13

•

8 years ago

Could it be that RefreshUpdateStatus() is getting called from somewhere else? Independently from the code which currently runs in nsUpdateDriver.cpp? Btw where can I find the implementation of RefreshUpdateStatus()?

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 14

•

8 years ago

Ok, so it's here:
https://dxr.mozilla.org/mozilla-central/rev/f0e6cc6360213ba21fd98c887b55fce5c680df68/toolkit/mozapps/update/nsUpdateService.js#3111

Something which might work is to use the --js-debugger option, which let you set breakpoints. Maybe it's really related in how we instruct Firefox to restart the browser (Services.startup.quit())?

Molly Howell (she/her) (no longer active)

Assignee

Comment 15

•

8 years ago

After seeing the above run, I tried again with nsThread logging enabled, and I've got this:

[... lots of copies of the next six lines here ...]
-1988102400[7fbc8e9941e0]: WaitForProcess: process still running, dispatching myself
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) Dispatch [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) ProcessNextEvent [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) running [7fbc78e62130]
-1988102400[7fbc8e9941e0]: ProcessHasTerminated: Checking state of updater process
-1988102400[7fbc8e9941e0]: ProcessHasTerminated: Updater process is still running; waiting 1 second before trying again
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbc9ca65aa0) sync shutdown
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [1 0]
[GMPThread]: D/nsThread THRD(7fbc9ca65aa0) running [7fbc7903ac20]
[GMPThread]: D/nsThread THRD(7fbc9ca65aa0) ProcessNextEvent [0 0]
[GMPThread]: D/nsThread THRD(7fbcbd95cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbcbd95cf80) sync shutdown
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [1 0]
[Link Monitor]: D/nsThread THRD(7fbcbd95cf80) ProcessNextEvent [0 0]
[Link Monitor]: D/nsThread THRD(7fbcbd95cf80) running [7fbc7903ac20]
[Link Monitor]: D/nsThread THRD(7fbcbd95cf80) ProcessNextEvent [0 0]
[Link Monitor]: D/nsThread THRD(7fbcbd95cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Timer]: D/nsThread THRD(7fbcbd95ceb0) ProcessNextEvent [0 0]
[Timer]: D/nsThread THRD(7fbcbd95ceb0) ProcessNextEvent [1 0]
[Main Thread]: D/nsThread THRD(7fbcbd95ceb0) sync shutdown
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [1 0]
[Timer]: D/nsThread THRD(7fbcbd95ceb0) running [7fbc7903ac20]
[Timer]: D/nsThread THRD(7fbcbd95ceb0) ProcessNextEvent [0 0]
[Timer]: D/nsThread THRD(7fbcbd95cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Main Thread]: D/nsThread THRD(7fbc7bfbb120) sync shutdown
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [1 0]
-1988102400[7fbc8e9941e0]: WaitForProcess: process still running, dispatching myself
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) Dispatch [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) ProcessNextEvent [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) running [7fbc7903ac20]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) ProcessNextEvent [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbcbd95cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Main Thread]: D/nsThread THRD(7f830ef5cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7f830ef5cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7f830ef5ceb0) Dispatch [0 0]
[... test proceeds to finish failing ...]

This doesn't make any sense to me; it looks like everything is fine until suddenly the absolute wrong function gets dispatched to the nsUpdateProcessor thread, but then somehow also ends up running on the main thread after that? I am not any less confused.

Molly Howell (she/her) (no longer active)

Assignee

Comment 16

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #13)
> Could it be that RefreshUpdateStatus() is getting called from somewhere
> else? Independently from the code which currently runs in
> nsUpdateDriver.cpp?

There are no other calls to RefreshUpdateStatus() anywhere except for the one in nsUpdateProcessor::UpdateDone(). And UpdateDone() isn't invoked from anywhere else either.

(In reply to Henrik Skupin (:whimboo) from comment #14)
> Something which might work is to use the --js-debugger option, which let you
> set breakpoints.

Hmm. Not sure how that would help, since I know there's only the one call site.

> Maybe it's really related in how we instruct Firefox to
> restart the browser (Services.startup.quit())?

Don't know how that would work, but I suppose stranger things have happened. At what points in the test does that get called?

Molly Howell (she/her) (no longer active)

Assignee

Comment 17

•

8 years ago

I'd really like to get an rr recording, but as far as I know rr wouldn't work in any of our various test runners, and I'm still without a local repro.

Comment hidden (Intermittent Failures Robot)

23 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-aurora: 20
* mozilla-central: 3

Platform breakdown:
* linux32: 18
* linux64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-09-22&endday=2016-09-22&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 19

•

8 years ago

(In reply to Matt Howell [:mhowell] from comment #17)
> I'd really like to get an rr recording, but as far as I know rr wouldn't
> work in any of our various test runners, and I'm still without a local repro.

Dustin, do you know something about rr on our testers? Is it something we could get working there, even if its on a one click loaner for now?

If that is not possible we might have to request temporary VPN-QA access for Matt, so that he could connect to one of our slave nodes.

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

8 years ago

To my understanding, rr does not work on EC2 instances.

Flags: needinfo?(dustin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 21

•

8 years ago

(In reply to Matt Howell [:mhowell] from comment #16)
> > Maybe it's really related in how we instruct Firefox to
> > restart the browser (Services.startup.quit())?
> 
> Don't know how that would work, but I suppose stranger things have happened.
> At what points in the test does that get called?

Each time when Firefox has to be restarted. We weren't able yet to click on the restart button, but could implement this soon given that bug 1298800 is fixed now.

Molly Howell (she/her) (no longer active)

Assignee

Comment 22

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #21)
> Each time when Firefox has to be restarted. We weren't able yet to click on
> the restart button, but could implement this soon given that bug 1298800 is
> fixed now.

In that case, I don't think that's related, because the code that's acting up here is on the path that determines when the restart button should appear; the trouble happens before it's time for a restart, not during or after.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 23

•

8 years ago

Some thinking here.... so far we were only able to see that behavior in Linux VMs or on Taskcluster, which uses docker images in virtual machines (AWS). Maybe it's some kind of trouble with virtualization?

Comment hidden (Intermittent Failures Robot)

124 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 103
* mozilla-central: 20
* mozilla-esr45: 1

Platform breakdown:
* linux32: 85
* linux64: 38
* osx-10-6: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-09-19&endday=2016-09-25&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 25

•

8 years ago

So something strange is happening here. Since I updated all the machines including the linux ones for latest OS and Java updates, the failure as layed out here on this bug is gone. At least for mozilla-aurora:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=fxup&filter-tier=1&filter-tier=2&filter-tier=3

So not sure if there was some process still around which caused such a behavior.

I do not have answers to the following two topics yet:

1. mozilla-central: Our tests are currently busted due to a change in how we use virtualenv with mozharness. Once that has been fixed (should be with the next set of nighly builds), we can re-verify.

2. I don't know if those problems would still be existent in Taskcluster. We currently don't use those workers but our own ones in mozmill-ci.

So lets wait a couple of days and revisit it then. FYI I will be on PTO starting Friday this week up to Thu next week.

Comment hidden (Intermittent Failures Robot)

13 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 12
* mozilla-central: 1

Platform breakdown:
* linux32: 7
* linux64: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-09-26&endday=2016-10-02&tree=all

Comment hidden (Intermittent Failures Robot)

5 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 5

Platform breakdown:
* linux32: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-10-03&endday=2016-10-09&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 28

•

8 years ago

Ok, so this is still an ongoing problem with our update tests. I still see failures on mozilla-aurora and that not too less.

Matt, would you have the time to look into this more again? It would be great if we can make some progress.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 29

•

8 years ago

After discovering bug 1309556 yesterday, I get the feeling this this remaining issue here could somehow be related. Means we are trying to re-connect to the still shutting down instance of Firefox, which then gets killed. The failure summary doesn't really match, but the fix for the race might still help.

Depends on: 1309556

Comment hidden (Intermittent Failures Robot)

7 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 7

Platform breakdown:
* linux32: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-10-10&endday=2016-10-16&tree=all

David Durst [:ddurst]

Comment 31

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #29)
> After discovering bug 1309556 yesterday, I get the feeling this this
> remaining issue here could somehow be related. Means we are trying to
> re-connect to the still shutting down instance of Firefox, which then gets
> killed. The failure summary doesn't really match, but the fix for the race
> might still help.

Are you suggesting that, with the patch for 1309556 landed (2 days ago), we should just wait and see if it recurs?

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 32

•

8 years ago

As of now we do not have results for mozilla-central. Reason was bug 1306421 which busted all of our tests on Linux and Windows. Starting with tomorrow we will have update test results again.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 33

•

8 years ago

The failure is still happening on mozilla-central. So the mentioned patch didn't contribute to fixing this bug. One other option we should try now is to actually restart Firefox for applying the update via the restart button. But not sure if that would change something.

Depends on: 1304656

Comment hidden (Intermittent Failures Robot)

33 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 21
* mozilla-central: 12

Platform breakdown:
* linux32: 24
* linux64: 9

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-10-24&endday=2016-10-30&tree=all

Comment hidden (Intermittent Failures Robot)

13 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 10
* mozilla-central: 3

Platform breakdown:
* linux32: 9
* linux64: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-10-31&endday=2016-11-06&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 36

•

8 years ago

All attempts to see this fixed for in_app restarts have failed. So I assume this has really to do in how we respawn the process after an update. Matt, this failure can still be seen each day and that multiple times. I wonder if there should be made another attempt to get this investigated.

status-firefox52: --- → affected

Flags: needinfo?(mhowell)

Molly Howell (she/her) (no longer active)

Assignee

Comment 37

•

8 years ago

In all honestly, I ran completely out of ideas for how to investigate this, much less to solve it. It needs somebody with experience working on weird low-level Linux-specific issues. I think I left enough information in comments here for someone else to pick this up, and if not I'm happy to explain the situation, but I'm afraid it's gotten beyond me.

Flags: needinfo?(mhowell)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 38

•

8 years ago

(In reply to Matt Howell [:mhowell] from comment #37)
> In all honestly, I ran completely out of ideas for how to investigate this,
> much less to solve it. It needs somebody with experience working on weird
> low-level Linux-specific issues. I think I left enough information in
> comments here for someone else to pick this up, and if not I'm happy to
> explain the situation, but I'm afraid it's gotten beyond me.

The only person I know out of my head for this might be Karl. Karl, not sure if you have the time to help us in fixing this process related issue when updating Firefox. If you can, we would really appreciate it. Thank you in advance.

Flags: needinfo?(karlt)

Karl Tomlinson (:karlt)

Comment 39

•

8 years ago

I don't see Fxup-auroratest on
https://treeherder.mozilla.org/#/jobs?repo=oak&filter-tier=1&filter-tier=2&filter-tier=3&exclusion_profile=false
What do I need to do to see the logs?

Can you verify that changes for bug 1272614 triggered this by reverting those
changes on oak?

(In reply to Matt Howell [:mhowell] from comment #12)
> [...]
> 1928328960[7f9766bba300]: ProcessHasTerminated: Checking state of updater
> process
> 1928328960[7f9766bba300]: ProcessHasTerminated: Updater process is still
> running; waiting 1 second before trying again
> 1928328960[7f9766bba300]: WaitForProcess: process still running, dispatching
> myself
> *** UTM:SVC TimerManager:registerTimer - id: xpi-signature-verification
> ATTENTION: default value of option force_s3tc_enable overridden by
> environment.
> *** AUS:SVC Creating UpdateService
> *** AUS:SVC readStatusFile - status: applying, path:
> /tmp/tmpDYP2gp.application.copy/updates/0/update.status
> [...]
> 
> So, my newly added logging tells us that WaitForProcess() correctly gets a
> false return from ProcessHasTerminated() and attempts to dispatch itself so
> it can check again, but then what actually happens is that UpdateDone()
> runs.

How do you know that UpdateDone() happens?

It sounds like some logging may be getting truncated?

Perhaps this may happen if the process exits abnormally for some reason and so
buffers are not flushed.

Can you add logging to UpdateDone() to confirm this theory?

If that confirms the theory, then I'd be inclined to use fprintf(stderr,) to
be sure MOZ_LOG() isn't writing to some other kind of stream, but I guess stderr
is not necessarily always flushed either.

Blocks: 1272614

Flags: needinfo?(karlt) → needinfo?(mhowell)

Keywords: regression

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 40

•

8 years ago

Matt, when you extend the logging, could you also add some timing information? It would be good to know how long it actually takes before we leave the loop waiting for the update being applied. Maybe in some cases it takes longer than 60s?

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 41

•

8 years ago

Btw. we also see this failure on Windows now; at least for mozilla-aurora:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=8027a7b723c9db7254bdb0bea3bc8937220daa7e&filter-searchStr=update%20windows&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=4082757&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable

Comment hidden (Intermittent Failures Robot)

41 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 27
* mozilla-central: 14

Platform breakdown:
* linux32: 30
* windows8-64: 4
* windows7-64: 4
* linux64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-11-07&endday=2016-11-13&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 43

•

8 years ago

Just want to mention that this is still our most often occurring test failure for update tests on Linux! This is for mozilla-central and mozilla-aurora.

Molly Howell (she/her) (no longer active)

Assignee

Comment 44

•

8 years ago

I promise I haven't forgotten about this! I do plan to try Karl's suggestions, but I've had other stuff going on recently.

Ryan VanderMeulen [:RyanVM]

Comment 45

•

8 years ago

Is this something we need to worry about for 50?

status-firefox53: --- → affected

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 46

•

8 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #45)
> Is this something we need to worry about for 50?

I cannot tell about update test results for betas and releases. Florin can give you an answer to this question.

Flags: needinfo?(hskupin) → needinfo?(florin.mezei)

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 47

•

8 years ago

Beta and Release have not been affected by this so far (I've also just run the update tests for Fx 50 on release-cdntest and did not run into this).

Flags: needinfo?(florin.mezei)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 48

•

8 years ago

Interesting. I wonder if there is special code around which only gets run when updating non-release builds.

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox50: affected → wontfix

Comment hidden (Intermittent Failures Robot)

30 failures in 715 pushes (0.042 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-aurora: 16
* mozilla-central: 14

Platform breakdown:
* linux32: 22
* linux64: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-11-14&endday=2016-11-20&tree=all

Comment hidden (Intermittent Failures Robot)

46 failures in 623 pushes (0.074 failures/push) were associated with this bug in the last 7 days. 

This is the #31 most frequent failure this week. 

Repository breakdown:
* mozilla-aurora: 24
* mozilla-central: 22

Platform breakdown:
* linux32: 31
* linux64: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-11-21&endday=2016-11-27&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 51

•

8 years ago

no much status in the bug in the last 2 weeks, checking in as this bug is quite frequent- is there anything we are stuck on?

David Durst [:ddurst]

Comment 52

•

8 years ago

We haven't prioritized this, because of the intermittent nature and comments 45-47. In addition, I know that mhowell has not been able to reproduce this well locally. Happy to take patches on this (if people have time) until mhowell is cleared of his current priorities.

Comment hidden (Intermittent Failures Robot)

33 failures in 694 pushes (0.048 failures/push) were associated with this bug in the last 7 days. 

This is the #38 most frequent failure this week. 

Repository breakdown:
* mozilla-central: 19
* mozilla-aurora: 14

Platform breakdown:
* linux32: 25
* linux64: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-11-28&endday=2016-12-04&tree=all

Comment hidden (Intermittent Failures Robot)

19 failures in 35 pushes (0.543 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* mozilla-aurora: 14
* mozilla-central: 5

Platform breakdown:
* linux32: 10
* linux64: 6
* osx-10-9: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-12-05&endday=2016-12-05&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 55

•

8 years ago

I have some interesting news. On bug 1322199 I'm currently working on an update for the proxy settings of our boxes. It looks like with those applied or some other unknown changes the problem is gone:

https://treeherder.allizom.org/#/jobs?repo=mozilla-central&exclusion_profile=false&filter-searchStr=firefox%20ui%20update%20linux&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=5569697

We should observe that for a couple of days. Maybe it will return.

Comment hidden (Intermittent Failures Robot)

20 failures in 42 pushes (0.476 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* mozilla-aurora: 18
* mozilla-central: 2

Platform breakdown:
* linux32: 12
* linux64: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-12-07&endday=2016-12-07&tree=all

Comment hidden (Intermittent Failures Robot)

68 failures in 289 pushes (0.235 failures/push) were associated with this bug in the last 7 days. 

This is the #3 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* mozilla-aurora: 48
* mozilla-central: 20

Platform breakdown:
* linux32: 41
* linux64: 23
* osx-10-9: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-12-05&endday=2016-12-11&tree=all

Comment hidden (Intermittent Failures Robot)

16 failures in 526 pushes (0.03 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-aurora: 9
* mozilla-central: 7

Platform breakdown:
* linux32: 16

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2016-12-12&endday=2016-12-18&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 59

•

8 years ago

During the last week this test failure showed only up for the 32bit Linux machines. The 64bit ones work perfectly fine. Not sure what could have changed that. Maybe it were the proxy updates I did on bug 1322199, and a related restart of the machines? But I restarted them all, so why do 32bit machines still show this problem?

Comment hidden (Intermittent Failures Robot)

13 failures in 563 pushes (0.023 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-aurora: 9
* mozilla-central: 4

Platform breakdown:
* linux32: 12
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-01-02&endday=2017-01-08&tree=all

Comment hidden (Intermittent Failures Robot)

11 failures in 722 pushes (0.015 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* autoland: 5
* mozilla-inbound: 2
* mozilla-central: 2
* mozilla-aurora: 2

Platform breakdown:
* linux32: 11

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-01-09&endday=2017-01-15&tree=all

Comment hidden (Intermittent Failures Robot)

11 failures in 690 pushes (0.016 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-aurora: 6
* mozilla-central: 4
* mozilla-inbound: 1

Platform breakdown:
* linux32: 10
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-01-16&endday=2017-01-22&tree=all

David Durst [:ddurst]

Comment 63

•

7 years ago

Should we just leave this open in perpetuity in case this recurs more severely? You've got the most insight to this right now, and no one has a cause or solution.

Flags: needinfo?(mhowell) → needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 64

•

7 years ago

As long as it isn't fixed we cannot close this bug.

Flags: needinfo?(hskupin)

Comment hidden (Intermittent Failures Robot)

5 failures in 749 pushes (0.007 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-release: 2
* mozilla-inbound: 2
* autoland: 1

Platform breakdown:
* linux32: 3
* osx-10-10: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-01-23&endday=2017-01-29&tree=all

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox51: affected → wontfix

status-firefox52: affected → fix-optional

status-firefox53: affected → fix-optional

status-firefox54: --- → fix-optional

Comment hidden (Intermittent Failures Robot)

25 failures in 161 pushes (0.155 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-aurora: 22
* mozilla-central: 3

Platform breakdown:
* linux32: 16
* linux64: 9

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-02-14&endday=2017-02-14&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 67

•

7 years ago

FYI we haven't had any update test results for Linux in the last 14 days because we build Nighlies on Linux via TC now, and funsize jobs didn't send out notifications, which we were listening for.

As it looks like the failure rate is again somewhat high.

Comment hidden (Intermittent Failures Robot)

39 failures in 833 pushes (0.047 failures/push) were associated with this bug in the last 7 days. 

This is the #48 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* mozilla-aurora: 22
* mozilla-central: 17

Platform breakdown:
* linux32: 25
* linux64: 14

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-02-13&endday=2017-02-19&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 69

•

7 years ago

:whimboo, can you look into fixing this or help find someone who can?

Flags: needinfo?(hskupin)

Whiteboard: [stockwell needswork]

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 70

•

7 years ago

Joel, please see comment 44 for the last update. Someone would have to try to get the stuff tested what karlt mentioned. It's actually not something I can do. Matt helped out here, but seems to have other priorities at the moment.

Flags: needinfo?(hskupin)

Joel Maher ( :jmaher ) (UTC -8)

Comment 71

•

7 years ago

this picked up in frequency in the last week, Assuming it stays at this rate, we would like to see this fixed or disabled within 2 weeks. :mhowell, I assume you own this test case and can help find the right people to work on this?

Flags: needinfo?(mhowell)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 72

•

7 years ago

Joel, please take into account that those tests are reporting as Tier-3 level. Sadly there is no way to distinguish that from Tier-1/2.

Joel Maher ( :jmaher ) (UTC -8)

Comment 73

•

7 years ago

oh, they shouldn't be starred and in my intermittent dashboard then.  I see what you mean, can we annotate this as a tier-3 test so it doesn't affect orange factor or ask the sheriffs not to annotate this?

Flags: needinfo?(mhowell)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 74

•

7 years ago

I annotate those failures to get a feeling of our intermittent failure rate for update tests. I have no idea how to track those failures otherwise. Maybe it would be wise to bump those jobs to Tier-2, especially because they are that important for release work. We should discuss this outside of this bug to be honest.

Comment hidden (Intermittent Failures Robot)

38 failures in 783 pushes (0.049 failures/push) were associated with this bug in the last 7 days. 

This is the #31 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* mozilla-aurora: 33
* autoland: 3
* mozilla-central: 2

Platform breakdown:
* linux32: 28
* linux64: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-02-27&endday=2017-03-05&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 76

•

7 years ago

This failure is now also present for update of beta builds. So no longer nightly builds only:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=2b89485cf946cdfdd69e5d11aa6af2563fa7520d&filter-searchStr=firefox%20ui%20update%20linux&filter-tier=1&filter-tier=2&filter-tier=3

status-firefox55: --- → fix-optional

Comment hidden (Intermittent Failures Robot)

37 failures in 169 pushes (0.219 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-aurora: 24
* mozilla-beta: 11
* mozilla-central: 2

Platform breakdown:
* linux32: 28
* linux64: 9

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-03-08&endday=2017-03-08&tree=all

Comment hidden (Intermittent Failures Robot)

71 failures in 790 pushes (0.09 failures/push) were associated with this bug in the last 7 days. 

This is the #26 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-aurora: 49
* mozilla-beta: 11
* mozilla-central: 9
* try: 1
* mozilla-inbound: 1

Platform breakdown:
* linux32: 60
* linux64: 11

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-03-06&endday=2017-03-12&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 79

•

7 years ago

the failures are primarily beta/aurora, :whimboo is this something you can figure out and fix?

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 80

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #79)
> the failures are primarily beta/aurora, :whimboo is this something you can
> figure out and fix?

It happens more often on Aurora and Beta because we are running update tests for different locales on those branches. As such we roughly get 4 x locale_count failures on Linux 32/64 per day.

I cannot work on this at this time while I have to finish up WebDriver P1 bugs, sorry.

Joel Maher ( :jmaher ) (UTC -8)

Comment 81

•

7 years ago

it looks like the recent failures stopped on March 14th and newer failures are on mozilla-central- but the rate has slowed down considerably.

it would be nice to resolve any easy wins here when the WebDriver work slows down.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 82

•

7 years ago

There are also other issues with updates lately eg. bug 1285340. So this might be the reason why we had a temporarily decrease of failures.

Flags: needinfo?(hskupin)

Comment hidden (Intermittent Failures Robot)

76 failures in 777 pushes (0.098 failures/push) were associated with this bug in the last 7 days. 

This is the #19 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-aurora: 59
* mozilla-central: 17

Platform breakdown:
* linux32: 51
* linux64: 25

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-03-13&endday=2017-03-19&tree=all

Comment hidden (Intermittent Failures Robot)

31 failures in 136 pushes (0.228 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 31

Platform breakdown:
* linux64: 16
* linux32: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-03-20&endday=2017-03-20&tree=all

Comment hidden (Intermittent Failures Robot)

22 failures in 186 pushes (0.118 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 19
* mozilla-central: 3

Platform breakdown:
* linux32: 12
* linux64: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-03-22&endday=2017-03-22&tree=all

Comment hidden (Intermittent Failures Robot)

21 failures in 174 pushes (0.121 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 16
* mozilla-central: 4
* autoland: 1

Platform breakdown:
* linux32: 15
* linux64: 5
* windows10-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-03-23&endday=2017-03-23&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 87

•

7 years ago

this keeps showing up on my radar, and almost all the issues are mozilla-aurora; possibly we just need to live with this?

Comment hidden (Intermittent Failures Robot)

103 failures in 898 pushes (0.115 failures/push) were associated with this bug in the last 7 days. 

This is the #7 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-aurora: 89
* mozilla-central: 13
* autoland: 1

Platform breakdown:
* linux32: 67
* linux64: 35
* windows10-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-03-20&endday=2017-03-26&tree=all

Comment hidden (Intermittent Failures Robot)

67 failures in 146 pushes (0.459 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-beta: 67

Platform breakdown:
* linux64: 49
* linux32: 18

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-03-31&endday=2017-03-31&tree=all

Comment hidden (Intermittent Failures Robot)

92 failures in 845 pushes (0.109 failures/push) were associated with this bug in the last 7 days. 

This is the #14 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-beta: 69
* mozilla-esr52: 8
* mozilla-aurora: 8
* mozilla-central: 4
* mozilla-release: 2
* mozilla-inbound: 1

Platform breakdown:
* linux64: 59
* linux32: 33

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-03-27&endday=2017-04-02&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 91

•

7 years ago

while this has a lot of failures, it is primarily mozilla-beta followed by esr and aurora.  :whimboo are you aware of this high failure rate?

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 92

•

7 years ago

Joel, as mentioned at least twice on this bug this is a product issue and needs a dev to sort this out and to get it fixed. It's not something I have expertise in. So Matt would be most likely person here to work on it, but might have other priorities. So release-drivers should figure that out. It's not something I can do.

Flags: needinfo?(hskupin) → needinfo?(lhenry)

Molly Howell (she/her) (no longer active)

Assignee

Comment 93

•

7 years ago

I'm going to be spending a bit of time on this to implement karlt's suggestions from comment 39, since that's the last actionable idea I have available. Starting with pushing a backout of bug 1272614 to oak to see what happens.

Comment hidden (Intermittent Failures Robot)

38 failures in 134 pushes (0.284 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-beta: 32
* mozilla-central: 6

Platform breakdown:
* linux64: 25
* linux32: 13

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-04&endday=2017-04-04&tree=all

Molly Howell (she/her) (no longer active)

Assignee

Comment 95

•

7 years ago

https://hg.mozilla.org/projects/oak/rev/fd7e1ee7273113ff5254ad467cd98b9f407a5278
Bug 1303834 - Backed out changeset d5fb267d0946 to see if this failure is affected

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 96

•

7 years ago

Thanks Matt. Does it mean we should test updates on oak, or will this be merged eventually to mozilla-central?

Molly Howell (she/her) (no longer active)

Assignee

Comment 97

•

7 years ago

I'd rather not merge this to central until we know if it makes a difference (and preferably understand why if so), so getting the tests run on oak would be ideal.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 98

•

7 years ago

Matt, I don't see that we build nightlies at all on oak. The problem could be that those are done via TC now and the job might not have been setup.

https://treeherder.mozilla.org/#/jobs?repo=oak&filter-searchStr=nightly

Could you check that and get those builds created? Florin and I are happy to test that but we would need usable builds. Thanks.

Flags: needinfo?(lhenry) → needinfo?(mhowell)

Robert Strong (they/them - no direct email)

Comment 99

•

7 years ago

oak nightly builds
https://archive.mozilla.org/pub/firefox/nightly/latest-oak/

Molly Howell (she/her) (no longer active)

Assignee

Updated

•

7 years ago

Flags: needinfo?(mhowell)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 100

•

7 years ago

(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #99)
> oak nightly builds
> https://archive.mozilla.org/pub/firefox/nightly/latest-oak/

Those are not Linux builds. It's only Mac and Windows as my TH link also shows.

Flags: needinfo?(robert.strong.bugs)

Robert Strong (they/them - no direct email)

Updated

•

7 years ago

Depends on: 1353819

Robert Strong (they/them - no direct email)

Comment 101

•

7 years ago

Filed bug 1353819

Flags: needinfo?(robert.strong.bugs)

Comment hidden (Intermittent Failures Robot)

40 failures in 151 pushes (0.265 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 26
* mozilla-central: 14

Platform breakdown:
* linux32: 24
* linux64: 16

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-05&endday=2017-04-05&tree=all

Comment hidden (Intermittent Failures Robot)

33 failures in 170 pushes (0.194 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 26
* mozilla-central: 6
* mozilla-inbound: 1

Platform breakdown:
* linux32: 18
* linux64: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-06&endday=2017-04-06&tree=all

Comment hidden (Intermittent Failures Robot)

174 failures in 867 pushes (0.201 failures/push) were associated with this bug in the last 7 days. 

This is the #6 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-aurora: 108
* mozilla-central: 33
* mozilla-beta: 32
* mozilla-inbound: 1

Platform breakdown:
* linux32: 89
* linux64: 85

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-03&endday=2017-04-09&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 105

•

7 years ago

can we disable these tests or mark them as tier-3?  We are at 3+ weeks of a very high failure rate.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 106

•

7 years ago

Those tests are Tier-3 level tests.

Flags: needinfo?(hskupin)

Joel Maher ( :jmaher ) (UTC -8)

Comment 107

•

7 years ago

how did we get 174 orangefactor stars if they are tier-3, I would not expect to get any orangefactor data for tier-3 on our integration and release branches.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 108

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #107)
> how did we get 174 orangefactor stars if they are tier-3, I would not expect
> to get any orangefactor data for tier-3 on our integration and release
> branches.

We are going in circles. :) Please see comment 74. If this is something I should not do, let me know and I will save my time no longer starring the failures. With that in mind I won't see a reason to further check for failures on integration branches, which means we might hit failures on the merge from aurora to beta on surprise.

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

7 years ago

Depends on: 1355009

Joel Maher ( :jmaher ) (UTC -8)

Comment 109

•

7 years ago

adding a tier-3 to the subject which should help reduce confusion.  Tier-3 is great for greening something up, but once you star something it gets into orangefactor and other failure dashboards and we are constantly reminded of it.

Summary: Intermittent test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!) → [tier-3] Intermittent test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell needswork] → [stockwell unknown]

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 110

•

7 years ago

(In reply to Matt Howell [:mhowell] from comment #97)
> I'd rather not merge this to central until we know if it makes a difference
> (and preferably understand why if so), so getting the tests run on oak would
> be ideal.

Looks like the nightly builds are available now. For example those could be used:

source: https://treeherder.mozilla.org/#/jobs?repo=oak&filter-searchStr=nightly&selectedJob=90480988
target: https://treeherder.mozilla.org/#/jobs?repo=oak&filter-searchStr=nightly&selectedJob=90235931

Florin, could you trigger such an update test on oak, and let it repeat a dozen of times if it's not failing? Thanks.

Flags: needinfo?(florin.mezei)

Comment hidden (Intermittent Failures Robot)

28 failures in 153 pushes (0.183 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 25
* mozilla-central: 3

Platform breakdown:
* linux32: 15
* linux64: 13

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-11&endday=2017-04-11&tree=all

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 112

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #110)
> Looks like the nightly builds are available now. For example those could be
> used:
> 
> source:
> https://treeherder.mozilla.org/#/jobs?repo=oak&filter-
> searchStr=nightly&selectedJob=90480988
> target:
> https://treeherder.mozilla.org/#/jobs?repo=oak&filter-
> searchStr=nightly&selectedJob=90235931
> 
> Florin, could you trigger such an update test on oak, and let it repeat a
> dozen of times if it's not failing? Thanks.


I can't seem to figure out the parameters for running this - http://mm-ci-production.qa.scl3.mozilla.com:8080/job/ondemand_update/62742/

Henrik can you advise?

Flags: needinfo?(florin.mezei) → needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 113

•

7 years ago

Checking the logs I can see that there is a problem in downloading the mozharness archive. Reason is that on oak no archives are getting created: https://hg.mozilla.org/integration/oak/archive/

So basically this is not solvable in mozmill-ci as long as we use the archiver script from releng to fetch the mozharness archive.

What we would have to do is to trigger the automated tests manually on the machines. It's similar to what I explained to you lately when we had issues with beta builds. After downloading and extracting the oak nightly, the tests should work fine. Would you mind doing that? If yes, I would appreciate. Also it would make sure that you are able to do it in cases when I'm not around.

Flags: needinfo?(hskupin)

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 114

•

7 years ago

So with help from Henrik I did manage to test this manually on a Linux machine. However, the tests failed because no update was offered for: https://aus5.mozilla.org/update/6/Firefox/53.0a1/20170410200459/Linux_x86-gcc3/en-US/nightly-oak/Linux%203.13.0-106-generic%20(GTK%203.10.8%2Clibpulse%204.0.0)/NA/default/default/update.xml?force=1.

Given that there are more recent builds here [1], was expecting to get an update.

[1] - https://archive.mozilla.org/pub/firefox/nightly/2017/04/2017-04-11-13-53-21-oak/

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 115

•

7 years ago

Ben, any idea why no updates are offered for the above URL?

Flags: needinfo?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 116

•

7 years ago

(In reply to Florin Mezei, QA (:FlorinMezei) from comment #115)
> Ben, any idea why no updates are offered for the above URL?

I had locked nightlies to an earlier revision that didn't have Linux ones while testing something. I just reverted that - should be fixed now.

Flags: needinfo?(bhearsum)

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 117

•

7 years ago

Thanks Ben!

I've re-tested on the same machine, and this time an update was indeed served, but I've hit bug 1260383 - JavascriptException: TypeError: ums.activeUpdate is null, for both Direct and Fallback updates (tried twice), with this sort of error showing in the logs:

1492013159680   Marionette      TRACE   6 -> [0,962,"executeScript",{"scriptTimeout":null,"newSandbox":true,"args":["16"],"filename":"windows.py","script":"\n              Components.utils.import(\"resource://gre/modules/Services.jsm\");\n\n              let win = Services.wm.getOuterWindowWithId(Number(arguments[0]));\n              return win.document.readyState == 'complete';\n            ","sandbox":"default","line":157}]
1492013159685   Marionette      TRACE   6 <- [1,962,null,{"value":true}]
ERROR: Error verifying signature.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 118

•

7 years ago

(In reply to Florin Mezei, QA (:FlorinMezei) from comment #117)
> ERROR: Error verifying signature.

This error comes from:
https://dxr.mozilla.org/mozilla-central/rev/f40e24f40b4c4556944c762d4764eace261297f5/modules/libmar/verify/mar_verify.c#453

Looks like the downloaded mar files could not be verified. Ben, not sure if you need the output from the updater log to get this investigated and fixed. Florin most likely could provide this tomorrow.

Flags: needinfo?(bhearsum)

Comment hidden (Intermittent Failures Robot)

52 failures in 154 pushes (0.338 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 29
* mozilla-release: 14
* mozilla-esr52: 5
* mozilla-central: 4

Platform breakdown:
* linux32: 27
* linux64: 25

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-12&endday=2017-04-12&tree=all

Comment hidden (Intermittent Failures Robot)

48 failures in 205 pushes (0.234 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-release: 48

Platform breakdown:
* linux64: 39
* linux32: 9

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-13&endday=2017-04-13&tree=all

Comment hidden (Intermittent Failures Robot)

207 failures in 155 pushes (1.335 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-esr52: 193
* mozilla-release: 14

Platform breakdown:
* linux64: 140
* linux32: 67

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-14&endday=2017-04-14&tree=all

Comment hidden (Intermittent Failures Robot)

364 failures in 894 pushes (0.407 failures/push) were associated with this bug in the last 7 days. 

This is the #3 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-esr52: 198
* mozilla-aurora: 78
* mozilla-release: 76
* mozilla-central: 10
* mozilla-beta: 2

Platform breakdown:
* linux64: 230
* linux32: 134

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-10&endday=2017-04-16&tree=all

bhearsum@mozilla.com (:bhearsum)

Comment 123

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #118)
> (In reply to Florin Mezei, QA (:FlorinMezei) from comment #117)
> > ERROR: Error verifying signature.
> 
> This error comes from:
> https://dxr.mozilla.org/mozilla-central/rev/
> f40e24f40b4c4556944c762d4764eace261297f5/modules/libmar/verify/mar_verify.
> c#453
> 
> Looks like the downloaded mar files could not be verified. Ben, not sure if
> you need the output from the updater log to get this investigated and fixed.
> Florin most likely could provide this tomorrow.

This is probably happening because we changed the branding on oak at one point. If your starting build is https://hg.mozilla.org/projects/oak/rev/b5d2520dc1ddcce8a6a02f823226f91eaa461683 or later, everything should be fine now.

Flags: needinfo?(bhearsum)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 124

•

7 years ago

Florin, can you please have another look for more recent oak nightlies? Thanks.

Flags: needinfo?(florin.mezei)

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 125

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #124)
> Florin, can you please have another look for more recent oak nightlies?
> Thanks.

I've tried this today on http://mm-ci-production.qa.scl3.mozilla.com:8080/computer/mm-ub-1404-32-3/ but encountered the following failure (build used was https://archive.mozilla.org/pub/firefox/nightly/2017/04/2017-04-16-11-02-36-oak/):

TEST-UNEXPECTED-ERROR | test_direct_update.py TestDirectUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360.0s)
Traceback (most recent call last):
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 166, in run
    testMethod()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/tests/firefox-ui/tests/update/direct/test_direct_update.py", line 20, in test_update
    self.download_and_apply_available_update(force_fallback=False)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_ui_harness/testcases.py", line 288, in download_and_apply_available_update
    self.restart()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_ui_harness/testcases.py", line 353, in restart
    super(UpdateTestCase, self).restart(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_puppeteer/mixins.py", line 71, in restart
    self.marionette.restart(in_app=True)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
    return func(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1222, in restart
    self._request_in_app_shutdown("eRestart")
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1156, in _request_in_app_shutdown
    self._send_message("quitApplication", {"flags": list(flags)})
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
    m._handle_socket_failure()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
    return func(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 726, in _send_message
    msg = self.client.request(name, params)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 284, in request
    return self.receive()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 211, in receive
    raise socket.timeout("Connection timed out after {}s".format(self.socket_timeout))

I tried some other machines but got another failure (so probably I missed something on those):
TEST-UNEXPECTED-ERROR | test_fallback_update.py TestFallbackUpdate.test_update | InvalidArgumentException: Unrecognised timeout: page load
stacktrace:
        WebDriverError@chrome://marionette/content/error.js:211:5
        InvalidArgumentError@chrome://marionette/content/error.js:301:5
        fromJSON@chrome://marionette/content/session.js:70:17
        GeckoDriver.prototype.setTimeouts@chrome://marionette/content/driver.js:1658:19
        execute/req<@chrome://marionette/content/server.js:510:22
        TaskImpl_run@resource://gre/modules/Task.jsm:319:42
        TaskImpl@resource://gre/modules/Task.jsm:277:3
        asyncFunction@resource://gre/modules/Task.jsm:252:14
        Task_spawn@resource://gre/modules/Task.jsm:166:12
        execute@chrome://marionette/content/server.js:500:15
        onPacket@chrome://marionette/content/server.js:471:7
        _onJSONObjectReady/<@chrome://marionette/content/server.js -> resource://devtools/shared/transport/transport.js:483:11
        exports.makeInfallible/<@resource://gre/modules/commonjs/toolkit/loader.js -> resource://devtools/shared/ThreadSafeDevToolsUtils.js:101:14
        exports.makeInfallible/<@resource://gre/modules/commonjs/toolkit/loader.js -> resource://devtools/shared/ThreadSafeDevToolsUtils.js:101:14
Traceback (most recent call last):
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 147, in run
    self.setUp()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/tests/firefox-ui/tests/update/fallback/test_fallback_update.py", line 11, in setUp
    UpdateTestCase.setUp(self, is_fallback=True)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_ui_harness/testcases.py", line 50, in setUp
    super(UpdateTestCase, self).setUp()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_puppeteer/mixins.py", line 77, in setUp
    super(PuppeteerMixin, self).setUp(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 468, in setUp
    super(MarionetteTestCase, self).setUp()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 261, in setUp
    self.marionette.timeout.reset()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/timeout.py", line 97, in reset
    self.page_load = DEFAULT_PAGE_LOAD_TIMEOUT
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/timeout.py", line 74, in page_load
    self._set("page load", sec)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/timeout.py", line 33, in _set
    self._marionette._send_message("setTimeouts", {name: ms})
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
    return func(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 729, in _send_message
    self._handle_error(err)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 762, in _handle_error
    raise errors.lookup(error)(message, stacktrace=stacktrace)

Flags: needinfo?(florin.mezei)

Comment hidden (Intermittent Failures Robot)

45 failures in 159 pushes (0.283 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 29
* mozilla-esr52: 15
* autoland: 1

Platform breakdown:
* linux32: 26
* linux64: 19

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-18&endday=2017-04-18&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 127

•

7 years ago

The above issues Florin pointed out are caused because the tests were run with the ondemand_update job, which certainly will not work for nightly builds. So after mentioning it to him, we run update tests for oak together on a Linux machine.

The results we got are not that promising but at least we are getting closer...

So as noticed with the fallback updates for an oak build from April 17th, we do not offer any partial update because those seem to not get build. There are only complete mar patches available:

https://aus5.mozilla.org/update/6/Firefox/55.0a1/20170417110320/Linux_x86-gcc3/en-US/nightly-oak/Linux%203.13.0-106-generic%20(GTK%203.10.8%2Clibpulse%204.0.0)/NA/default/default/update.xml?force=1

Running the update tests with those builds the issue on this bug didn't surface at all. So I also had a look at various update tests for recent beta and rc candidate builds and noticed an interesting fact:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=d345b657d381ade5195f1521313ac651618f54a2&filter-searchStr=firefox%20ui%20linux%20update&filter-tier=1&filter-tier=2&filter-tier=3

All the failures happen with a fallback update for initially served partial updates! If the initial patch is a complete one (which starts with 53.0b9 downwards) no connection issues are happening with Marionette after the final restart!

So I believe this is strongly related to partial patches and fallback updates.

Simon and Ben, I wonder if there is a way to enable funsize partial patch generation for the oak branch. At least for Linux where we have to investigate this problem.

Severity: normal → critical

Flags: needinfo?(sfraser)

Flags: needinfo?(bhearsum)

Summary: [tier-3] Intermittent test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!) → [tier-3] Intermittent test_fallback_update.py TestFallbackUpdate.test_update (partial MAR) | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!)

Simon Fraser [:sfraser] ⌚️GMT

Comment 128

•

7 years ago

Have submitted https://github.com/mozilla-releng/funsize/pull/58 to add the routes for oak, awaiting review.

Flags: needinfo?(sfraser)

bhearsum@mozilla.com (:bhearsum)

Comment 129

•

7 years ago

I don't have time to look at this anytime soon, I'm swamped with Dawn work.

Flags: needinfo?(bhearsum)

Comment hidden (Intermittent Failures Robot)

50 failures in 186 pushes (0.269 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-release: 26
* mozilla-esr52: 17
* mozilla-central: 7

Platform breakdown:
* linux64: 42
* linux32: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-19&endday=2017-04-19&tree=all

Simon Fraser [:sfraser] ⌚️GMT

Comment 131

•

7 years ago

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #128)
> Have submitted https://github.com/mozilla-releng/funsize/pull/58 to add the
> routes for oak, awaiting review.

This is now in place. The workers for this are a bit overloaded at times, so if it's possible to not trigger nightlies at similar times to the existing ones that would help a great deal.

Simon.

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 132

•

7 years ago

I've re-tested this with https://archive.mozilla.org/pub/firefox/nightly/2017/04/2017-04-20-11-02-08-oak/. 

I tested 6 times - Passed 2 times, and failed 4 times with: IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)

I'm also attaching below two logs: one for a passed test, and one for a failed test. It seems to me from the logs that we fetched the partial.mar.

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 133

•

7 years ago

Attached file log_pass.txt — Details

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 134

•

7 years ago

Attached file log_fail.txt — Details

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 135

•

7 years ago

Thank you Florin. Matt, as it looks like reverting the one change didn't have any effect. Maybe could this somewhat be also related to what we have seen today on bug 1358402 and monkey patched to make it work?

Flags: needinfo?(mhowell)

Molly Howell (she/her) (no longer active)

Assignee

Comment 136

•

7 years ago

Hmm. I'm not sure. What we had in bug 1358402 was the Marionette server just never trying to start, right? Here it is trying, but can't because of the NS_ERROR_SOCKET_ADDRESS_IN_USE.

Also, that error doesn't happen here until the second restart, the one to finish applying the fallback complete update, not the one during which we overwrite the status file.

Flags: needinfo?(mhowell)

Comment hidden (Intermittent Failures Robot)

196 failures in 162 pushes (1.21 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-beta: 196

Platform breakdown:
* linux64: 105
* linux32: 91

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-21&endday=2017-04-21&tree=all

Comment hidden (Intermittent Failures Robot)

372 failures in 817 pushes (0.455 failures/push) were associated with this bug in the last 7 days. 

This is the #1 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-beta: 196
* mozilla-aurora: 89
* mozilla-esr52: 32
* mozilla-central: 28
* mozilla-release: 26
* autoland: 1

Platform breakdown:
* linux64: 206
* linux32: 166

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-17&endday=2017-04-23&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 139

•

7 years ago

(In reply to Matt Howell [:mhowell] from comment #136)
> Also, that error doesn't happen here until the second restart, the one to
> finish applying the fallback complete update, not the one during which we
> overwrite the status file.

Oh, that's correct. Sorry.

Matt, given that comment 93 was not working, could you add the suggestions from Karl so that we can see if we can get some more information? Thanks.

Flags: needinfo?(mhowell)

Comment hidden (Intermittent Failures Robot)

105 failures in 183 pushes (0.574 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-beta: 91
* autoland: 8
* mozilla-inbound: 6

Platform breakdown:
* linux32: 52
* linux64: 39
* windows10-64-vm: 14

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-25&endday=2017-04-25&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 141

•

7 years ago

(In reply to OrangeFactor Robot from comment #140)
> 105 failures in 183 pushes (0.574 failures/push) were associated with this
> bug yesterday.   
> 
> Repository breakdown:
> * mozilla-beta: 91

Due to the amount of tests we have to run on beta (including all the locales) this is becoming crazy now. Matt, we would really appreciate if you could find some time to implement the suggestions from Karl. Thanks.

Molly Howell (she/her) (no longer active)

Assignee

Comment 142

•

7 years ago

(In reply to Karl Tomlinson (back Apr 26 :karlt) from comment #39)
> Can you verify that changes for bug 1272614 triggered this by reverting those
> changes on oak?

We have tried this, and it didn't help.

> How do you know that UpdateDone() happens?

Because that's what triggers the AUS:SVC lines that appear next.

> It sounds like some logging may be getting truncated?
> 
> Perhaps this may happen if the process exits abnormally for some reason and
> so buffers are not flushed.

The log described here agrees with the one in comment 15, so that would have to mean both of those logs (from different runs) got truncated in exactly the same place. And saying these are getting "truncated" doesn't seem to make sense, because the lines that appear to be missing would be in the middle of the logging we have, not at the end.

> Can you add logging to UpdateDone() to confirm this theory?
> 
> If that confirms the theory, then I'd be inclined to use fprintf(stderr,) to
> be sure MOZ_LOG() isn't writing to some other kind of stream, but I guess
> stderr is not necessarily always flushed either.

I could push a patch to oak that adds this logging, but in light of the above I don't think it would tell us anything we do not already know.

Flags: needinfo?(mhowell)

Comment hidden (Intermittent Failures Robot)

22 failures in 135 pushes (0.163 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 12
* mozilla-inbound: 8
* mozilla-central: 2

Platform breakdown:
* windows10-64-vm: 22

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-26&endday=2017-04-26&tree=all

Comment hidden (Intermittent Failures Robot)

161 failures in 883 pushes (0.182 failures/push) were associated with this bug in the last 7 days. 

This is the #4 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-beta: 91
* mozilla-central: 32
* autoland: 23
* mozilla-inbound: 15

Platform breakdown:
* linux32: 63
* linux64: 54
* windows10-64-vm: 44

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-04-24&endday=2017-04-30&tree=all

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 145

•

7 years ago

Similar to bug 1355818, this bug here seems to show much more often on beta than on beta-cdntest. In fact, for the past 3 builds, there has been no failed job on the beta-cdntest channel on Linux, but multiple failures on the beta channel (e.g. for 54b4 and 54b5)/

Results for 54b5:
- beta-cdntest [1] - 0 jobs failed jobs
- beta [2] - 8 jobs failed (some locales + en-US) - not re-run as we've decided not to do that anymore because it takes a lot of time and effort

[1] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=06bf49fb5795&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-beta-cdntest(
[2] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=06bf49fb5795&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-beta(

Comment hidden (Intermittent Failures Robot)

119 failures in 150 pushes (0.793 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-release: 70
* mozilla-esr52: 40
* mozilla-beta: 9

Platform breakdown:
* linux64: 72
* linux32: 47

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-05-05&endday=2017-05-05&tree=all

Comment hidden (Intermittent Failures Robot)

115 failures in 770 pushes (0.149 failures/push) were associated with this bug in the last 7 days. 

This is the #8 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-release: 68
* mozilla-esr52: 40
* mozilla-beta: 7

Platform breakdown:
* linux64: 70
* linux32: 45

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-05-01&endday=2017-05-07&tree=all

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 148

•

7 years ago

Following up on findings in comment 145 (and same as for bug 1355818), on the same day I've also run tests for the dot release 53.0.2 and ESR 52.1.1. The results of these tests seem to confirm the findings on beta: the more we move towards the official channels, the more failures we see. Oddly enough, I actually got zero failures of this kind on localtest and cdntest channels, while the official channels were quite a bit more tricky (basically the same thing I saw for Beta). See below the detailed results:

1. Results for 53.0.2:
   a) release-localtest [1] - 0 failures (tests were 100% green)
   b) release-cdntest [2] - 0 failures (tests were 100% green)
   c) release [3] - 10 jobs failed - passed after multiple re-runs (71 in total)

2. Results for ESR 52.1.1):
   a) esr-localtest [4] - 0 failures (tests were 100% green)
   b) esr-cdntest [5] - 0 failures (tests were 100% green)
   c) esr [6] - 7 jobs failed - passed after multiple re-runs (40 in total)

[1] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=f87a819106bd&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-release-localtest(
[2] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=f87a819106bd&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-release-cdntest(
[3] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=f87a819106bd&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-release(

[4] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr52&revision=120111e65bc4&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-esr-localtest(
[5] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr52&revision=120111e65bc4&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-esr-cdntest(
[6] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr52&revision=120111e65bc4&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-esr(

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 149

•

7 years ago

I haven't tested different channels around the same time on CI machines so far. Maybe that would be a good idea to do, Florin. That way we could see if we can exclude time related failures. Florin, could you do that? You would only have to add `--update-channel %name` as option to the `firefox-ui-update` command.

Flags: needinfo?(florin.mezei)

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 150

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #149)
> I haven't tested different channels around the same time on CI machines so
> far. Maybe that would be a good idea to do, Florin. That way we could see if
> we can exclude time related failures. Florin, could you do that? You would
> only have to add `--update-channel %name` as option to the
> `firefox-ui-update` command.

I'll do this after we publish 54b6 to beta - so I should have some results tomorrow.

Flags: needinfo?(florin.mezei)

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Updated

•

7 years ago

Flags: needinfo?(florin.mezei)

[SV Manager] Florin Mezei, QA (:FlorinMezei)

Comment 151

•

7 years ago

I've run 9 jobs today on http://mm-ci-production.qa.scl3.mozilla.com:8080/computer/mm-ub-1404-32-3/, for the update 53.0 -> 53.0.2 - 3 updates on release-localtest, 3 on release-cdntest, and 3 on release. All jobs passed. 

I'm leaving the needinfo on so I can try this again on another day.

Comment hidden (Intermittent Failures Robot)

29 failures in 879 pushes (0.033 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-central: 21
* mozilla-beta: 8

Platform breakdown:
* linux32: 15
* linux64: 14

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-05-08&endday=2017-05-14&tree=all

Comment hidden (Intermittent Failures Robot)

262 failures in 166 pushes (1.578 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-release: 146
* mozilla-esr52: 112
* mozilla-central: 4

Platform breakdown:
* linux64: 161
* linux32: 101

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-05-18&endday=2017-05-18&tree=all

Comment hidden (Intermittent Failures Robot)

89 failures in 146 pushes (0.61 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-release: 65
* mozilla-beta: 20
* mozilla-esr52: 4

Platform breakdown:
* linux64: 38
* linux32: 35
* windows8-64: 10
* windows7-64: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-05-19&endday=2017-05-19&tree=all

Comment hidden (Intermittent Failures Robot)

374 failures in 777 pushes (0.481 failures/push) were associated with this bug in the last 7 days. 

This is the #1 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-release: 213
* mozilla-esr52: 113
* mozilla-beta: 27
* mozilla-central: 21

Platform breakdown:
* linux64: 209
* linux32: 150
* windows8-64: 9
* windows7-64: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-05-15&endday=2017-05-21&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 156

•

7 years ago

Good news! There is no more such a failure on Sunday! Which means all 4 partial updates tested on Linux 32 and 64 are passing. It's something which we haven't had for a very long time. So assume that the patch on bug 1355818 also fixed this issue. But not sure how it correlates. Anyway, I will keep an eye on it over the next days.

Depends on: 1355818

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 157

•

7 years ago

The same result today. Not a single failure on the Linux machines! It's something which we haven't had all the last months! So I call this bug fixed by the patch on bug 1355818. Thanks Matt!

Assignee: nobody → mhowell

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox55: fix-optional → fixed

Flags: needinfo?(florin.mezei)

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 158

•

7 years ago

The patch on bug 1355818 got uplifted to beta:
https://hg.mozilla.org/releases/mozilla-beta/rev/82fcc3b1242a

status-firefox54: fix-optional → fixed

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox52: fix-optional → wontfix

status-firefox53: fix-optional → wontfix

status-firefox-esr52: --- → fix-optional

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox-esr52: fix-optional → fixed

Comment hidden (Intermittent Failures Robot)

19 failures in 150 pushes (0.127 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-esr52: 10
* mozilla-release: 9

Platform breakdown:
* linux64: 11
* linux32: 8

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-06-13&endday=2017-06-13&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 160

•

7 years ago

This wasn't in our tests but in the application updater.

Component: Firefox UI Tests → Application Update

Product: Testing → Toolkit

QA Contact: hskupin

Version: Version 3 → unspecified

Comment hidden (Intermittent Failures Robot)

23 failures in 814 pushes (0.028 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-esr52: 14
* mozilla-release: 9

Platform breakdown:
* linux64: 13
* linux32: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-06-12&endday=2017-06-18&tree=all

Comment hidden (Intermittent Failures Robot)

108 failures in 193 pushes (0.56 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-release: 108

Platform breakdown:
* linux32: 62
* linux64: 46

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-06-29&endday=2017-06-29&tree=all

Comment hidden (Intermittent Failures Robot)

108 failures in 718 pushes (0.15 failures/push) were associated with this bug in the last 7 days. 

This is the #8 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-release: 108

Platform breakdown:
* linux32: 62
* linux64: 46

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1303834&startday=2017-06-26&endday=2017-07-02&tree=all

log_pass.txt 7 years ago [SV Manager] Florin Mezei, QA (:FlorinMezei) 1.13 MB, text/plain		Details
log_fail.txt 7 years ago [SV Manager] Florin Mezei, QA (:FlorinMezei) 295.11 KB, text/plain		Details