Closed Bug 1261598 Opened 8 years ago Closed 7 years ago

Intermittent Automation Error: Received unexpected exception while running application | after error: [Errno 111] Connection refused

Categories

(Testing :: Marionette Client and Harness, defect)

defect
Not set
normal

Tracking

(firefox57 fixed, firefox58 fixed)

RESOLVED FIXED
mozilla58
Tracking Status
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: aryx, Assigned: whimboo)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:other])

Attachments

(1 file)

https://treeherder.mozilla.org/logviewer.html#?job_id=25030028&repo=mozilla-inbound

00:20:46     INFO -  MochitestServer : launching [u'/home/worker/workspace/build/tests/bin/xpcshell', '-g', '/home/worker/workspace/build/application/firefox', '-v', '170', '-f', '/home/worker/workspace/build/tests/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpYwE2t1.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', '/home/worker/workspace/build/tests/mochitest/server.js']
00:20:46     INFO -  runtests.py | Server pid: 1145
00:20:46     INFO -  runtests.py | Websocket server pid: 1148
00:20:46     INFO -  runtests.py | SSL tunnel pid: 1152
00:20:47     INFO -  runtests.py | Running tests: start.
00:20:47     INFO -  runtests.py | Application pid: 1173
00:20:47     INFO -  TEST-INFO | started process Main app process
00:20:47     INFO -  ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /tmp/tmpYwE2t1.mozrunner/runtests_leaks.log
00:20:58     INFO -  ++DOCSHELL 0x7f254ede3800 == 1 [pid = 1173] [id = 1]
00:20:58     INFO -  ++DOMWINDOW == 1 (0x7f254ede4000) [pid = 1173] [serial = 1] [outer = (nil)]
00:20:58     INFO -  [1173] WARNING: No inner window available!: file /home/worker/workspace/build/src/dom/base/nsGlobalWindow.cpp, line 9773
00:20:58     INFO -  ++DOMWINDOW == 2 (0x7f254ede5000) [pid = 1173] [serial = 2] [outer = 0x7f254ede4000]
00:21:48     INFO -  Traceback (most recent call last):
00:21:48     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2357, in doTests
00:21:48     INFO -      marionette_args=marionette_args,
00:21:48     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 1984, in runApp
00:21:48     INFO -      self.marionette.start_session(timeout=port_timeout)
00:21:48     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1166, in start_session
00:21:48     INFO -      self.protocol, _ = self.client.connect()
00:21:48     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 226, in connect
00:21:48     INFO -      self.sock.connect((self.addr, self.port))
00:21:48     INFO -    File "/usr/lib/python2.7/socket.py", line 224, in meth
00:21:49     INFO -      return getattr(self._sock,name)(*args)
00:21:49     INFO -  error: [Errno 111] Connection refused
00:21:49     INFO -  0 ERROR Automation Error: Received unexpected exception while running application
Depends on: 1293093
There's been a significant spike in frequency today, and higher-than-normal frequency for this bug for the last couple of days. It seems like a lot of the recent failures are linux64-asan and gpu or mochitest-gl.

:whimboo -- Do you know what's going on?
Flags: needinfo?(hskupin)
A pattern I see that this is all mochitest only, right? And all the time when it happens we run some tests and then the Mochitest server seems to start a new Firefox process:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=74110423&lineNumber=1840

I don't know those suites to give a clear answer, but maybe this is a shutdown/startup issue? It would be great to see a test job with gecko logging turned on (if possible), if the failure rate stays that high.
Flags: needinfo?(hskupin)
See Also: → 1318951
:gbrown is this something you could look at in the next week or two?
Flags: needinfo?(gbrown)
note, this pattern changed around Feb 1st and our rate of hitting this bug increased.
Isn't that bug 1318951 what Geoff added a day ago under "see also"?
I am not clear if that bug is really misstars of this bug or vice versa.  I am under the impression this bug is different from the other and needs to be fixed by someone on our team.
Will see what I can do...
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #11)
> I don't know those suites to give a clear answer, but maybe this is a
> shutdown/startup issue? It would be great to see a test job with gecko
> logging turned on (if possible), if the failure rate stays that high.

I can reproduce on try fairly easily. How do I turn on gecko logging?
Flags: needinfo?(hskupin)
(In reply to Geoff Brown [:gbrown] from comment #18)
> (In reply to Henrik Skupin (:whimboo) from comment #11)
> > I don't know those suites to give a clear answer, but maybe this is a
> > shutdown/startup issue? It would be great to see a test job with gecko
> > logging turned on (if possible), if the failure rate stays that high.
> 
> I can reproduce on try fairly easily. How do I turn on gecko logging?

With marionette harness itself its just "--gecko-log", but I don't know how other test harnesses are doing that. Maybe Joel knows about?
Flags: needinfo?(hskupin) → needinfo?(jmaher)
I have no idea, gbrown had asked me before asking in the bug.
Flags: needinfo?(jmaher)
Ok, so I had a look and there is an option when starting Firefox which is "--console". If the test harness you are checking here isn't setting it, please hack it into the mozrunner run_command call.
Thanks :whimboo. 

https://treeherder.mozilla.org/#/jobs?repo=try&revision=97becc65975e0567282ab2fb941971ea0cf7132d has --console and reproduces this failure. For example:

https://public-artifacts.taskcluster.net/AxKCpVtaRFGg6Suwe2anEQ/0/public/logs/live_backing.log

[task 2017-02-10T23:38:04.684283Z] 23:38:04     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox --console -marionette -foreground -profile /tmp/tmp1gV_zP.mozrunner
[task 2017-02-10T23:38:04.715424Z] 23:38:04     INFO - runtests.py | Application pid: 1128
[task 2017-02-10T23:38:04.716011Z] 23:38:04     INFO - TEST-INFO | started process Main app process
[task 2017-02-10T23:40:04.754794Z] 23:40:04     INFO -  Traceback (most recent call last):
[task 2017-02-10T23:40:04.755610Z] 23:40:04     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2417, in doTests
[task 2017-02-10T23:40:04.756719Z] 23:40:04     INFO -      marionette_args=marionette_args,
[task 2017-02-10T23:40:04.758099Z] 23:40:04     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2020, in runApp
[task 2017-02-10T23:40:04.759219Z] 23:40:04     INFO -      self.marionette.start_session(timeout=port_timeout)
[task 2017-02-10T23:40:04.760575Z] 23:40:04     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-02-10T23:40:04.762437Z] 23:40:04     INFO -      m._handle_socket_failure()
[task 2017-02-10T23:40:04.763931Z] 23:40:04     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-02-10T23:40:04.765374Z] 23:40:04     INFO -      return func(*args, **kwargs)
[task 2017-02-10T23:40:04.767229Z] 23:40:04     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1284, in start_session
[task 2017-02-10T23:40:04.769433Z] 23:40:04     INFO -      self.protocol, _ = self.client.connect()
[task 2017-02-10T23:40:04.770252Z] 23:40:04     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 223, in connect
[task 2017-02-10T23:40:04.770992Z] 23:40:04     INFO -      self.sock.connect((self.addr, self.port))
[task 2017-02-10T23:40:04.771704Z] 23:40:04     INFO -    File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2017-02-10T23:40:04.776997Z] 23:40:04     INFO -      return getattr(self._sock,name)(*args)
[task 2017-02-10T23:40:04.782340Z] 23:40:04     INFO -  error: [Errno 111] Connection refused
[task 2017-02-10T23:40:04.783482Z] 23:40:04    ERROR - Automation Error: Received unexpected exception while running application

I don't think that helped.

I also added gecko_log to the Marionette arguments, but that didn't seem to work.

I also defined MOZ_LOG with some common nspr logging components ("all" doesn't work currently)...that generated some additional info in the mozLogs.zip artifact, but didn't produce anything that jumps out at me as an explanation for what's going wrong here.
I suppose Firefox may be crashing on startup; I think that would explain Connection Refused? I wonder, if runtests.py caught the failure here and continued, would it find the app process still running, would it find a crash dump? I'll look into that next.
Maybe other test harnesses beside Mn do not correctly handle thrown exceptions when trying to use Marionette for installing an addon? Keep in mind that in such cases Marionette doesn't do any crash checking and such. It would be all up to the harness which using Marionette.
Recent findings...

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eea99c2033987e186e9d43fd49ff6311b44ad287
Fail: https://public-artifacts.taskcluster.net/SVI8qSOaTTClKpArCIrb5A/0/public/logs/live_backing.log
OK:   https://public-artifacts.taskcluster.net/KDi2NWBzTUWoxvVcCVVYoA/0/public/logs/live_backing.log

I don't think Firefox is crashing:

[task 2017-02-13T02:47:25.051571Z] 02:47:25    ERROR - Force-terminating active process(es).
[task 2017-02-13T02:47:25.052837Z] 02:47:25     INFO - Determining child pids from psutil
[task 2017-02-13T02:47:25.059342Z] 02:47:25     INFO - Found child pids: [1959, 1981]
[task 2017-02-13T02:47:25.060637Z] 02:47:25     INFO - Killing process: 1959
[task 2017-02-13T02:47:25.061522Z] 02:47:25     INFO - TEST-INFO | started process screentopng
[task 2017-02-13T02:47:25.581131Z] 02:47:25     INFO - TEST-INFO | screentopng: exit 0
[task 2017-02-13T02:47:25.585511Z] 02:47:25     INFO - Can't trigger Breakpad, just killing process
[task 2017-02-13T02:47:25.586586Z] 02:47:25     INFO - Killing process: 1981
[task 2017-02-13T02:47:25.588030Z] 02:47:25     INFO - Not taking screenshot here: see the one that was previously logged
[task 2017-02-13T02:47:25.589271Z] 02:47:25     INFO - Can't trigger Breakpad, just killing process
[task 2017-02-13T02:47:26.699131Z] 02:47:26     INFO - psutil found pid 1959 dead
[task 2017-02-13T02:47:26.700186Z] 02:47:26     INFO - psutil found pid 1981 dead
[task 2017-02-13T02:47:26.700899Z] 02:47:26     INFO - Killing process: 1955


"Marionette enabled via command-line flag" is missing from the failed cases:

https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/testing/marionette/components/marionette.js#111

but it should be there:

[task 2017-02-13T02:41:54.105081Z] 02:41:54     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox --console -marionette -foreground -profile /tmp/tmpntlzpC.mozrunner
Startup is hanging at https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/toolkit/xre/nsXREDirProvider.cpp#1155:

    em->Observe(nullptr, "addons-startup", nullptr);

DoStartup() is called from https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/toolkit/xre/nsAppRunner.cpp#4289; since that doesn't progress, the command line handlers are never called and marionette is never started.

I'll check into why the addon manager startup appears to be hanging.
...and addon manager startup hangs due to a hang somewhere in CryptoTask.

Still making slow progress here, tracking down the source of the hang with logging and try pushes.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bfb2e557d65cfb110b70f8b82eb259896a6ccdc1
:gbrown, any updates in this bug?  I recall you mentioning some progress, it would be good to document it in here.
Flags: needinfo?(gbrown)
Whiteboard: [stockwell needswork]
tl;dr - During automated test runs, browser startup sometimes hangs at https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1596; the CryptoTask that it is waiting for was called off main thread, called NS_DispatchToMainThread(), but is not called again.


The details:
 - Marionette's connection is refused because the marionette server (in firefox) is not running.
 - Marionette is not running because it was never started; it was not started because firefox startup has hung before processing command line arguments like '-marionette'.
 - Startup is hung at https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/toolkit/xre/nsXREDirProvider.cpp#1155:

    em->Observe(nullptr, "addons-startup", nullptr);

 - Addons startup is hung at:

 syncLoadManifestFromFile@resource://gre/modules/addons/XPIProvider.jsm:1598:16
 processPendingFileChanges@resource://gre/modules/addons/XPIProvider.jsm:3408:19
 checkForChanges@resource://gre/modules/addons/XPIProvider.jsm:3698:19
 startup@resource://gre/modules/addons/XPIProvider.jsm:2802:25
 callProvider@resource://gre/modules/AddonManager.jsm:272:12
 _startProvider@resource://gre/modules/AddonManager.jsm:756:5
 startup@resource://gre/modules/AddonManager.jsm:939:9
 startup@resource://gre/modules/AddonManager.jsm:3136:5
 observe@resource://gre/components/addonManager.js:65:9

Sorry, those line numbers were modified by my logging in https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1ecf27fcebc97a0c7d997a3e2f06631d5c24786...so, addons startup is hung at:

https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1596

  while (success === undefined)
    thread.processNextEvent(true);

I can reproduce this failure on try, on linux64-asan. If I trigger a crash from the harness when the marionette connection times out, I consistently get a stack in nsAppShell::ProcessNextNativeEvent:

https://public-artifacts.taskcluster.net/VRVjeVOzSxuOGwAywnDkaA/0/public/logs/live_backing.log

#26 0x7f4954a044ab in g_main_context_iteration (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x4a4ab)
#27 0x7f4946de010e in nsAppShell::ProcessNextNativeEvent(bool) /home/worker/workspace/build/src/widget/gtk/nsAppShell.cpp:270:12
#28 0x7f4946d3d114 in DoProcessNextNativeEvent /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:138:17
#29 0x7f4946d3d114 in nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:271
#30 0x7f4946d3dbaf in non-virtual thunk to nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:233:17
#31 0x7f4940b2e7f4 in nsThread::ProcessNextEvent(bool, bool*) /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp:1220:5

That seems consistent/unsurprising.

 - Meanwhile, we have this happening:
CryptoTask::Run(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/security/manager/ssl/CryptoTask.cpp#55
...
openSignedAppFileAsync(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/security/apps/AppSignatureVerification.cpp#1044
verifyZipSignedState(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1829
loadManifestFromZipReader(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1527
loadManifestFromFile(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1573

In all failures, CryptoTask::Run() is called off main thread and completes the call to NS_DispatchToMainThread(), then is never called again (is not called on the main thread).


My interpretation is that CryptoTask::Run() is waiting for the main thread, but the main thread is blocked on thread.processNextEvent(true)...but I'm out of my depth and unsure of the details.

If I change to thread.processNextEvent(false), I can no longer reproduce the hang, but that doesn't seem like a sensible solution.


:mossop -- It looks like you might know this code. Can you help me understand the issues, or suggest a solution?
Flags: needinfo?(gbrown) → needinfo?(dtownsend)
I don't know the crypto code or why it might be blocked on the main thread, you probably want Dan for that.
Flags: needinfo?(dtownsend) → needinfo?(dveditz)
no activity here in a week, checking in as this is still a very high frequency failure.
This is actually happening nearly all the time for Linux64 and I wonder if this is related to bug 1293093.

Reason is:

[task 2017-03-05T23:54:21.632367Z] 23:54:21     INFO - runtests.py | Application pid: 1642
[task 2017-03-05T23:54:21.633212Z] 23:54:21     INFO - TEST-INFO | started process Main app process
[task 2017-03-05T23:56:21.722802Z] 23:56:21     INFO -  Traceback (most recent call last):

We try to start Firefox but fail after 2 minutes.
Depends on: 1345105
Comment 41 documents the issue: A deadlock involving crypto code waiting for the main thread during startup.

Bug 1345105 may help (or worsen) the behavior by changing startup timing, but will not address the issue.

I'm waiting on the needinfo to dveditz.
I think David Keeler can also help us here.
Flags: needinfo?(dkeeler)
(In reply to Geoff Brown [:gbrown] from comment #41)
...
> In all failures, CryptoTask::Run() is called off main thread and completes
> the call to NS_DispatchToMainThread(), then is never called again (is not
> called on the main thread).
> 
> 
> My interpretation is that CryptoTask::Run() is waiting for the main thread,
> but the main thread is blocked on thread.processNextEvent(true)...but I'm
> out of my depth and unsure of the details.

It's not waiting for the main thread - it's waiting to run on the main thread (which processNextEvent should do).

> If I change to thread.processNextEvent(false), I can no longer reproduce the
> hang, but that doesn't seem like a sensible solution.

This turns an idle wait into a busy wait.

It seems like what's happening here is that the CryptoTask is being added to the main thread's event queue but for some reason processNextEvent doesn't process the event unless called repeatedly. Maybe running with the nsThread log set to debug would be informative?
Flags: needinfo?(dkeeler)
I am not sure who is owning fixing this bug?  :gbrown is this something you can do, or do you need someone like :keeler to fix this?
Flags: needinfo?(gbrown)
Sorry, couldn't look at this yesterday; it remains a high priority for me.

I will try to reproduce with nsThread logging, as suggested in comment 51.

I am happy to hand-off this bug to anyone who is interested, but I think it remains unclear what component is to blame (we know where browser startup is hanging now, but don't fully understand why).
Flags: needinfo?(gbrown)
The platform breakdown in comment 54 is quite different from earlier. Curiously, this bug remains mostly Linux on trunk, but spiked suddenly on aurora March 6-9, primarily on Windows PGO.
The aurora/windows-pgo failures look quite different:

[Errno 10061] No connection could be made because the target machine actively refused it

Those started with https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=0931190cd725a744b74f8f47863b7b38be8e8ebf&filter-searchStr=windows+pgo. I understand tomcat is looking into it.


I'll remain focused on the Linux failures.
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce67f735b6525cb5cb6a9a4f629a205a6ffc7008, I added nsThread logging, which I see in mozLogs.zip files for all the successful test runs. Unfortunately, for the failed cases, like https://treeherder.mozilla.org/logviewer.html#?job_id=83862794&repo=try, mozLogs.zip does not contain any traces for the failed pid.
Well, for successful jobs we clearly see that events are getting processed on the main thread:

[task 2017-03-15T00:51:03.032162Z] 00:51:03     INFO - runtests.py | Application pid: 1130
[task 2017-03-15T00:51:03.032244Z] 00:51:03     INFO - TEST-INFO | started process GECKO(1130)
[task 2017-03-15T00:51:04.063991Z] 00:51:04     INFO - GECKO(1130) | ZZZ XREMain::XRE_mainRun calls dirprovider.DoStartup
[task 2017-03-15T00:51:04.105691Z] 00:51:04     INFO - GECKO(1130) | nsXREDirProvider::DoStartup calls addons Observe for addons-startup on main thread
[task 2017-03-15T00:51:04.731357Z] 00:51:04     INFO - GECKO(1130) | ZZZ OpenSignedAppFileAsync
[task 2017-03-15T00:51:04.732343Z] 00:51:04     INFO - GECKO(1130) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T00:51:04.733710Z] 00:51:04     INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T00:51:04.735141Z] 00:51:04     INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T00:51:04.749637Z] 00:51:04     INFO - GECKO(1130) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T00:51:04.749777Z] 00:51:04     INFO - GECKO(1130) | ZZZ CryptoTask::Run 0x60b000041ee0 - NS_DispatchToMainThread okay
[task 2017-03-15T00:51:04.765823Z] 00:51:04     INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T00:51:04.766754Z] 00:51:04     INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0

Whereby for failing jobs the following is happening:

[task 2017-03-15T01:33:50.248452Z] 01:33:50     INFO - runtests.py | Application pid: 1804
[task 2017-03-15T01:33:50.248541Z] 01:33:50     INFO - TEST-INFO | started process GECKO(1804)
[task 2017-03-15T01:33:51.090452Z] 01:33:51     INFO - GECKO(1804) | ZZZ XREMain::XRE_mainRun calls dirprovider.DoStartup
[task 2017-03-15T01:33:51.126704Z] 01:33:51     INFO - GECKO(1804) | nsXREDirProvider::DoStartup calls addons Observe for addons-startup on main thread
[task 2017-03-15T01:33:51.756964Z] 01:33:51     INFO - GECKO(1804) | ZZZ OpenSignedAppFileAsync
[task 2017-03-15T01:33:51.757053Z] 01:33:51     INFO - GECKO(1804) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T01:33:51.757988Z] 01:33:51     INFO - GECKO(1804) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T01:33:51.758663Z] 01:33:51     INFO - GECKO(1804) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T01:33:51.782045Z] 01:33:51     INFO - GECKO(1804) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T01:33:51.785720Z] 01:33:51     INFO - GECKO(1804) | ZZZ CryptoTask::Run 0x60b000041e30 - NS_DispatchToMainThread okay
[task 2017-03-15T01:35:50.279997Z] 01:35:50     INFO -  Traceback (most recent call last):

The crypto task got dispatched to the main thread, right? But then we have 2 minutes without any processing of events. At least this proves the assumptions so far.

Would it help to enable all kind of logging, so we might see what is blocking the main thread from processing events?
Flags: needinfo?(dkeeler)
Thanks for having a look, whimboo. 

I tried changing MOZ_LOG to "all" - I got no mozLogs.zip output at all! https://treeherder.mozilla.org/#/jobs?repo=try&revision=9de48e76eef69082f8d67c2f52ad0372ee6cb11f
You missed to specify the level. So "all:5" should do it.
Looks like some of those tests are faling with "Log file size 83275573 is greater than max allowed 52428800! Setting TBPL_FAILURE (was SUCCESS)..." - could that be preventing the log file from being uploaded? You could try just enabling some logs (although it looks like you did that in comment 60 and it didn't help...) . In PSM, the relevant ones would be pipnss:4 and maybe certverifier:5.
Flags: needinfo?(dkeeler)
Rather than fiddle with MOZ_LOG further, I just changed the LOG statements in nsThread.cpp into printf's:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=42299ff365d4d3827a19ac8713e0a95a8e073435

That causes even more "Log file size is greater than max allowed" failures, and also seems to make it more difficult to reproduce this hang, but I was still able to produce a hang and a corresponding successful run:


Hang:

https://public-artifacts.taskcluster.net/TCIijhPiSY-yvA3Dv8M8gg/0/public/logs/live_backing.log

[task 2017-03-15T19:54:10.847829Z] 19:54:10     INFO - GECKO(2276) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T19:54:10.848564Z] 19:54:10     INFO - GECKO(2276) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:54:10.849391Z] 19:54:10     INFO - GECKO(2276) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:54:10.850233Z] 19:54:10     INFO - GECKO(2276) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:54:10.871748Z] 19:54:10     INFO - GECKO(2276) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T19:54:10.874809Z] 19:54:10     INFO - GECKO(2276) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:54:10.875728Z] 19:54:10     INFO - GECKO(2276) | ZZZ CryptoTask::Run 0x60b0000420f0 - NS_DispatchToMainThread okay
[task 2017-03-15T19:54:10.876495Z] 19:54:10     INFO - GECKO(2276) | THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-15T19:54:10.877298Z] 19:54:10     INFO - GECKO(2276) | THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:55:00.167762Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-15T19:55:00.168578Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.169364Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) running [0x621000643d20]
[task 2017-03-15T19:55:00.170168Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.170924Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-15T19:55:00.171643Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.172459Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) running [0x607000065580]
[task 2017-03-15T19:55:00.173193Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.173760Z] 19:55:00     INFO - GECKO(2276) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:55:10.173609Z] 19:55:10     INFO - GECKO(2276) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-15T19:55:10.174726Z] 19:55:10     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:10.174793Z] 19:55:10     INFO - GECKO(2276) | THRD(0x6120000d7d40) running [0x621000643d20]
[task 2017-03-15T19:55:10.174852Z] 19:55:10     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:56:09.314943Z] 19:56:09     INFO -  Traceback (most recent call last):
[task 2017-03-15T19:56:09.315267Z] 19:56:09     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2491, in doTests
[task 2017-03-15T19:56:09.315942Z] 19:56:09     INFO -      marionette_args=marionette_args,
[task 2017-03-15T19:56:09.317786Z] 19:56:09     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2095, in runApp
[task 2017-03-15T19:56:09.317854Z] 19:56:09     INFO -      self.marionette.start_session(timeout=port_timeout)
[task 2017-03-15T19:56:09.318038Z] 19:56:09     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-03-15T19:56:09.318753Z] 19:56:09     INFO -      m._handle_socket_failure()
[task 2017-03-15T19:56:09.319745Z] 19:56:09     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-03-15T19:56:09.320979Z] 19:56:09     INFO -      return func(*args, **kwargs)
[task 2017-03-15T19:56:09.322240Z] 19:56:09     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1292, in start_session
[task 2017-03-15T19:56:09.323561Z] 19:56:09     INFO -      self.protocol, _ = self.client.connect()
[task 2017-03-15T19:56:09.325582Z] 19:56:09     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 223, in connect
[task 2017-03-15T19:56:09.326296Z] 19:56:09     INFO -      self.sock.connect((self.addr, self.port))
[task 2017-03-15T19:56:09.326902Z] 19:56:09     INFO -    File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2017-03-15T19:56:09.327380Z] 19:56:09     INFO -      return getattr(self._sock,name)(*args)
[task 2017-03-15T19:56:09.327855Z] 19:56:09     INFO -  error: [Errno 111] Connection refused


Success:

[task 2017-03-15T19:59:39.346412Z] 19:59:39     INFO - GECKO(2207) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T19:59:39.352121Z] 19:59:39     INFO - GECKO(2207) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T19:59:39.353319Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:59:39.354864Z] 19:59:39     INFO - GECKO(2207) | ZZZ CryptoTask::Run 0x60b000042040 - NS_DispatchToMainThread okay
[task 2017-03-15T19:59:39.355903Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-15T19:59:39.356826Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.379060Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x607000000f00]
[task 2017-03-15T19:59:39.381582Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.382870Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.383540Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.384713Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.385528Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x604000017390]
[task 2017-03-15T19:59:39.386594Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.387948Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.389559Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x604000040f10]
[task 2017-03-15T19:59:39.390943Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.391911Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.393182Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x6120000c5d40]
[task 2017-03-15T19:59:39.395848Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:59:39.396906Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.398041Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.398829Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.399920Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x6040000bde90]
[task 2017-03-15T19:59:39.400867Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.402731Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.403491Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x607000065740]
[task 2017-03-15T19:59:39.404278Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.405062Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.405678Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x60700006d150]
[task 2017-03-15T19:59:39.406488Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.407274Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.408085Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x6040000f3990]
[task 2017-03-15T19:59:39.408950Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.416002Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.418090Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x60b000042040]
[task 2017-03-15T19:59:39.418860Z] 19:59:39     INFO - GECKO(2207) | ZZZ CryptoTask::Run on main thread
[task 2017-03-15T19:59:39.421100Z] 19:59:39     INFO - GECKO(2207) | ZZZ OpenSignedAppFileTask::CallCallback called
[task 2017-03-15T19:59:39.422265Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) sync shutdown
[task 2017-03-15T19:59:39.422841Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) running [0x6060002753c0]
[task 2017-03-15T19:59:39.423602Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-15T19:59:39.425564Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:59:39.426182Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 1]
[task 2017-03-15T19:59:39.428992Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f048161f5db572f518b449fad9d2399b5bcf8eb0 has thread names plus some additional logging. 

In the failed cases, "Cache2 I/O" is the last thread to call ProcessNextEvent before the hang, but that ProcessNextEvent call seems to complete - not clear that there's any problem there. In the failed cases, the main thread is last seen calling ProcessNextEvent just before OpenSignedAppFileAsync is called; I still don't see what happens to that thread.
Well, the interesting piece here is the timing. Have a look at the following excerpt:

[task 2017-03-16T04:05:52.028141Z] 04:05:52     INFO - GECKO(1128) | [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-16T04:05:52.028184Z] 04:05:52     INFO - GECKO(1128) | ==> return
[task 2017-03-16T04:05:52.028236Z] 04:05:52     INFO - GECKO(1128) | [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
[task 2017-03-16T04:06:41.348327Z] 04:06:41     INFO - GECKO(1128) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-16T04:06:41.348608Z] 04:06:41     INFO - GECKO(1128) | [Cache2 I/O] THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-16T04:06:41.349038Z] 04:06:41     INFO - GECKO(1128) | THRD(0x6120000d7d40) running [0x621000643d20]
[task 2017-03-16T04:06:41.350368Z] 04:06:41     INFO - GECKO(1128) | ==> return

There is a hang of about 50s between the return of the SingedJAR thread, and dispatching and running the Cache2 I/O thread.

Would strace maybe give us further information?
That is an interesting pause, present in the failures and not present in the successful runs.

strace might produce something interesting...do you know how to set that up, whimboo?
(In reply to Geoff Brown [:gbrown] from comment #70)
> strace might produce something interesting...do you know how to set that up,
> whimboo?

Sadly not. But Dustin gave a quick hint in bug 1331198 comment 16. So something like "strace -p <firefox pid>" should be enough. Finally I haven't had to use it to fix the problem.
My latest try push, with lots of logging: https://treeherder.mozilla.org/#/jobs?repo=try&revision=331bfa1250d8ef90a5a6a684f404d5b4d9d40b1c

Note that "-b o -p linux64-asan -u mochitest-gpu -t none --rebuild 20" seems to reliably reproduce the hang.
(In reply to Henrik Skupin (:whimboo) from comment #69)
> Well, the interesting piece here is the timing. Have a look at the following
> excerpt:
> 
> [task 2017-03-16T04:05:52.028141Z] 04:05:52     INFO - GECKO(1128) |
> [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
> [task 2017-03-16T04:05:52.028184Z] 04:05:52     INFO - GECKO(1128) | ==>
> return
> [task 2017-03-16T04:05:52.028236Z] 04:05:52     INFO - GECKO(1128) |
> [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
> [task 2017-03-16T04:06:41.348327Z] 04:06:41     INFO - GECKO(1128) |
> THRD(0x6120000d7d40) Dispatch [(nil) 0]
> [task 2017-03-16T04:06:41.348608Z] 04:06:41     INFO - GECKO(1128) | [Cache2
> I/O] THRD(0x6120000d7d40) ProcessNextEvent [0 0]
> [task 2017-03-16T04:06:41.349038Z] 04:06:41     INFO - GECKO(1128) |
> THRD(0x6120000d7d40) running [0x621000643d20]
> [task 2017-03-16T04:06:41.350368Z] 04:06:41     INFO - GECKO(1128) | ==>
> return
> 
> There is a hang of about 50s between the return of the SingedJAR thread, and
> dispatching and running the Cache2 I/O thread.

At first I thought that was interesting too, but actually, I think there's no reason to expect the Cache2 thread to be active immediately after the SignedJAR thread event completes. In successful runs, there is also a long pause between Cache2 activity. (That may not be apparent in successful runs because the main thread, and other threads, are continuously active.)
(In reply to Geoff Brown [:gbrown] from comment #76)
> My latest try push, with lots of logging:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=331bfa1250d8ef90a5a6a684f404d5b4d9d40b1c

Looking at one of the failures:

https://public-artifacts.taskcluster.net/G4-HSvXAQV6KEbrO6N_yGw/0/public/logs/live_backing.log

[task 2017-03-22T18:05:22.879544Z] 18:05:22     INFO - GECKO(1468) | nsXREDirProvider::DoStartup calls addons Observe for addons-startup on main thread
[task 2017-03-22T18:05:23.138075Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.478692Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.480116Z] 18:05:23     INFO - GECKO(1468) | 1490205923472	addons.xpi	WARN	syncLoadManifestFromFile waiting for completion
[task 2017-03-22T18:05:23.480809Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] THRD(0x61400000ce40) ProcessNextEvent [1 0]
[task 2017-03-22T18:05:23.482189Z] 18:05:23     INFO - GECKO(1468) | ZZZ OpenSignedAppFileAsync on main thread
[task 2017-03-22T18:05:23.483832Z] 18:05:23     INFO - GECKO(1468) | THRD(0x6140000a8640) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.485374Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) CryptoTask::Dispatch OK
[task 2017-03-22T18:05:23.485962Z] 18:05:23     INFO - GECKO(1468) | ZZZ OpenSignedAppFileAsync - SignedJAR Dispatch OK
[task 2017-03-22T18:05:23.486627Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.486726Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] THRD(0x61400000ce40) ProcessNextEvent calls OnProcessNextEvent [1]
[task 2017-03-22T18:05:23.487338Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent, mayWait=0
[task 2017-03-22T18:05:23.490190Z] 18:05:23     INFO - GECKO(1468) | [SignedJAR] THRD(0x6140000a8640) ProcessNextEvent [0 0]
[task 2017-03-22T18:05:23.490953Z] 18:05:23     INFO - GECKO(1468) | THRD(0x6140000a8640) running [0x60b000041a10]
[task 2017-03-22T18:05:23.491779Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent returns 1
[task 2017-03-22T18:05:23.492888Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent, mayWait=0
[task 2017-03-22T18:05:23.518732Z] 18:05:23     INFO - GECKO(1468) | OpenSignedAppFile returns NS_OK
[task 2017-03-22T18:05:23.521801Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) OpenSignedAppFileTask::CalculateResult returns
[task 2017-03-22T18:05:23.522734Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.523675Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) CryptoTask::Run - NS_DispatchToMainThread okay
[task 2017-03-22T18:05:23.524548Z] 18:05:23     INFO - GECKO(1468) | ==> THRD(0x6140000a8640) ProcessNextEvent return
[task 2017-03-22T18:05:23.525419Z] 18:05:23     INFO - GECKO(1468) | [SignedJAR] THRD(0x6140000a8640) ProcessNextEvent [0 0]
[task 2017-03-22T18:05:23.526485Z] 18:05:23     INFO - GECKO(1468) | ==> THRD(0x6140000a8640) ProcessNextEvent return
[task 2017-03-22T18:05:23.527467Z] 18:05:23     INFO - GECKO(1468) | [SignedJAR] THRD(0x6140000a8640) ProcessNextEvent [1 0]
[task 2017-03-22T18:06:12.826174Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400008fe40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:12.827068Z] 18:06:12     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.828077Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400008fe40) running [0x621000643d20]
[task 2017-03-22T18:06:12.828912Z] 18:06:12     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.829729Z] 18:06:12     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.830412Z] 18:06:12     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.831404Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400008fe40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:12.832129Z] 18:06:12     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.832858Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400008fe40) running [0x607000065580]
[task 2017-03-22T18:06:12.833615Z] 18:06:12     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.834452Z] 18:06:12     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.835233Z] 18:06:12     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.835908Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:22.845240Z] 18:06:22     INFO - GECKO(1468) | THRD(0x61400008fe40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:22.845358Z] 18:06:22     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:22.846286Z] 18:06:22     INFO - GECKO(1468) | THRD(0x61400008fe40) running [0x621000643d20]
[task 2017-03-22T18:06:22.846355Z] 18:06:22     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:22.846815Z] 18:06:22     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:22.847242Z] 18:06:22     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:07:22.068483Z] 18:07:22     INFO -  Traceback (most recent call last):

Note particularly this part:

[task 2017-03-22T18:05:23.491779Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent returns 1
[task 2017-03-22T18:05:23.492888Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent, mayWait=0
[task 2017-03-22T18:05:23.518732Z] 18:05:23     INFO - GECKO(1468) | OpenSignedAppFile returns NS_OK
[task 2017-03-22T18:05:23.521801Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) OpenSignedAppFileTask::CalculateResult returns
[task 2017-03-22T18:05:23.522734Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.523675Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) CryptoTask::Run - NS_DispatchToMainThread okay

https://hg.mozilla.org/try/rev/51e6963f68a269fc69c76f9844f588a94fd58af7#l7.18

nsAppShell::ProcessNextNativeEvent(), running on the main thread, called g_main_context_iteration() but that never returned. OpenSignedAppFile() and OpenSignedAppFileTask::CalculateResult() completed immediately after, but that's not on the main thread. What happened to the main thread?


:keeler -- Can you have another look? I don't know what else to do here.
Assignee: gbrown → nobody
Flags: needinfo?(dkeeler)
do we have an idea of what tests are causing this?  Maybe we can narrow it down that way.
I don't think there is any connection to individual tests -- this is a hang on browser startup, well before any tests are run. It might be affected by add-ons or anything else configured in the test profile.

It seems to affect all mochitests, but hits mochitest-gl and mochitest-gpu more often than plain - I don't know why.

I wouldn't be surprised if there is a similar issue affecting reftests that is reported differently, but I haven't noticed anything specific.
To test out the assumption that something in the threading of OpenSignedAppFile was causing the timeouts, I wrote a patch that (I believe) made it synchronous rather than async. With 30 retriggers, the intermittent didn't show up:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e24de258c0bca34c852af74bdf61ee1b668e85ea

This then reminded me of bug 1185489, which involves code in this area that isn't thread-safe. I wrote a different patch that just removed the code in question. Again, after 30 retriggers, the intermittent didn't show up:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb748ae85e71956819f37502008d508813682cd2

To convince myself that these results were significant, I attempted to reproduce the intermittent on the base revision that both of those patches are based on:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=32bad569ddff5bc4c1be3dd7ed0122090af401ec&filter-searchStr=gpu

.. and unfortunately I haven't been able to make that fail, despite that being the same code as this:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=4c987b7ed54a&filter-searchStr=linux%20x64%20asan%20gpu

which does have a failure.

I'll keep looking, but unfortunately I really don't have anything to add at the moment.
Flags: needinfo?(dkeeler)
Thanks :keeler. Since my logging seems to make this happen with greater regularity, here are try pushes with my logging on its own + with each of your patches: https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown@mozilla.com&tochange=86612c999285064cdc09abcb3c7398e1069089aa&fromchange=032ca845c994d6d9118c1db376daa0072d72fac6
...your "convert openSignedAppFileAsync to sync" patch seems to fix the hang!
:keeler - Not trying to rush you, but thought I'd check in to see if you noticed the previous couple of comments. Is converting to sync a viable solution here? Let me know if I can run more try experiments or help with anything else.
Flags: needinfo?(dkeeler)
I did see those comments, but thanks for the needinfo anyway. Converting that function to a synchronous call isn't really viable - the file reading/public key operations are long running and not something we want to block the main thread. What's odd is this use of CryptoTask is failing so frequently while others don't seem to. I'll keep looking, though.
Flags: needinfo?(dkeeler)
Flags: needinfo?(dveditz)
:keeler, this is still causing a lot of pain on our trees, can you look into a more appropriate fix for openSignedAppFileAsync  (per comment 13) ?
Flags: needinfo?(dkeeler)
I didn't manage to get to this today, but it's on my radar. Keeping the ni? so I remember to have a look.
It looks like you are doing everything right. I think something has shifted and my trick for reproducing is not working now.

I thought perhaps we would have better luck with a different test job, but that logging patch generates so much output, almost all the tests fail, and create massive log files. I trimmed it down with more hackery, in https://treeherder.mozilla.org/#/jobs?repo=try&revision=46f757fb8a70752db14cad81718db384b610cc17, but I'm still not having any luck reproducing the failure.

The failure is still happening, according to Orange Factor, but reproducing it on try seems difficult. If you have a possible solution, I'd suggest landing it and seeing if the failures go away. Otherwise, we may just need to wait and see if we can reproduce it again in future. (Or rollback and push a revision from a couple weeks ago?)
Flags: needinfo?(gbrown)
oddly this has greatly reduced in frequency after April 11th.
Whiteboard: [stockwell needswork] → [stockwell unknown]
See Also: → 1352671
This is failing frequently again. I pushed the logging patch that :keeler and I were using earlier: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab246e341381addda1437dd67039e0506d5fadde. Good news: It is now easy to reproduce this on linux32-debug. Bad news: The hang seems to have moved.

I'll try to track it down again.
Flags: needinfo?(gbrown)
I am seeing a few different causes of this failure. Let's look at one...

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c8d621a1f21ba3c5f6c63e3595c2c0c0d9b2e028

https://public-artifacts.taskcluster.net/CH-cN9NLRpqunwGTEKcUxQ/0/public/logs/live_backing.log

 20:39:51     INFO - TEST-INFO | started process GECKO(1082)
 20:39:51     INFO - GECKO(1082) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /tmp/tmpxcE5GX.mozrunner/runtests_leaks.log
 20:41:44     INFO - GECKO(1082) | XRE_mainRun inits commandline...
...
 20:41:45     INFO - GECKO(1082) | 1494362505551	Marionette	INFO	marionette enabled by command line flag
...
 20:41:49     INFO - GECKO(1082) | ++DOMWINDOW == 12 (0xde3e3c00) [pid = 1082] [serial = 12] [outer = 0xde4b8c00]
 20:41:51     INFO -  Traceback (most recent call last):
...
 20:41:51     INFO -  error: [Errno 111] Connection refused
 20:41:51    ERROR - Automation Error: Received unexpected exception while running application

It seems the -marionette flag has been seen and Marionette enabled -- https://hg.mozilla.org/try/rev/af523ebe17152d836b87244f56a79d81aed0337a#l1.12 -- but not initialized -- https://hg.mozilla.org/try/rev/af523ebe17152d836b87244f56a79d81aed0337a#l1.28.

There was activity just before the connection error, at 20:41:49...I wonder if we just hadn't gotten around to initializing MarionetteComponent. It looks like the connection error happens 120 seconds after process launch; could that time period be lengthened, and might that avoid some of these failures, especially on slow platforms like linux32-debug?

:whimboo -- Thoughts?
Flags: needinfo?(gbrown) → needinfo?(hskupin)
Hm, we wait for the "sessionstore-windows-restored" observer topic to be fired, before we do the initialization. Maybe this one is coming through too late or maybe never?

https://dxr.mozilla.org/mozilla-central/rev/b21b974d60d3075ae24f6fb1bae75d0f122f28fc/testing/marionette/components/marionette.js#197

On Windows we also wait until the GFX sanity window has been closed.
Flags: needinfo?(hskupin)
Thanks - I was wondering how that worked.

Reviewing some of the successful jobs in https://treeherder.mozilla.org/#/jobs?repo=try&revision=c8d621a1f21ba3c5f6c63e3595c2c0c0d9b2e028, I see more evidence that it can take ~2 minutes to init marionette:

https://public-artifacts.taskcluster.net/FmPpTwH8SOS9xFVUKNGhMg/0/public/logs/live_backing.log

20:42:26     INFO - TEST-INFO | started process GECKO(1096)
20:44:18     INFO - GECKO(1096) | 1494362658486	Marionette	INFO	Marionette init
20:44:18     INFO - GECKO(1096) | 1494362658960	Marionette	INFO	Listening on port 2828

https://public-artifacts.taskcluster.net/SPeyE3TqRXq9EAUhcBxEaQ/0/public/logs/live_backing.log
20:40:44     INFO - TEST-INFO | started process GECKO(1088)
20:42:24     INFO - GECKO(1088) | 1494362544274	Marionette	INFO	Marionette init
20:42:24     INFO - GECKO(1088) | 1494362544695	Marionette	INFO	Listening on port 2828

It can take less time - between 60 and 90 seconds seems common - but it isn't hard to find cases close to 120 seconds.
Geoff, would you mind to push a try with the following changes:
https://hg.mozilla.org/try/rev/d53acf55de513e4eaa6915877d9556e9d896bd3d

I use that also for another bug and this output is pretty helpful. It will show us which observer notifications Marionette actually receives. It should give us an indication where in the initialization code we are waiting.
I also tried increasing DEFAULT_STARTUP_TIMEOUT from 120 seconds to 180 seconds: https://treeherder.mozilla.org/#/jobs?repo=try&revision=300dc393115b1973cf7c28ee522c45dd1dec11d0. With this change, I could not reproduce this failure.
So here the timing for a failing test:

[task 2017-05-10T15:08:46.572745Z] 15:08:46     INFO - TEST-INFO | started process GECKO(1085)
[..]
[task 2017-05-10T15:09:54.628136Z] 15:09:54     INFO - GECKO(1085) | 1494428994626	Marionette	INFO	Received observer notification "sessionstore-windows-restored"
[task 2017-05-10T15:09:55.037043Z] 15:09:55     INFO - GECKO(1085) | 1494428995035	Marionette	INFO	Listening on port 2828

It means that it has been taken 1:10min to start Marionette for this debug build. This is insanely long! 

The Linux workers are using Ubuntu 16.04.2 LTS, so bug 1345105 shouldn't have an effect here.

Geoff I would propose that you request a one-click loaner, run the setup task, and run the tests manually by inspecting the cpu load time while firefox is starting up. I could imagine there is some other process which takes a fair amount of cpu power and slows Firefox down.
(In reply to Henrik Skupin (:whimboo) from comment #111)
> It means that it has been taken 1:10min to start Marionette for this debug
> build. This is insanely long! 

It is a long time from a user perspective, but this is linux32 and debug and some cheap aws machine -- I don't feel qualified to criticize! :)

It looks to me like startup in this environment is usually between 60 and 120 seconds, with something like 3% of jobs taking more than 120 seconds (causing failures here).

> Geoff I would propose that you request a one-click loaner, run the setup
> task, and run the tests manually by inspecting the cpu load time while
> firefox is starting up. I could imagine there is some other process which
> takes a fair amount of cpu power and slows Firefox down.

Good idea. I did that, I noted startup taking 60 seconds+, as expected. top indicated firefox-bin taking 95% CPU, python taking 4% CPU during startup; I didn't see any other significant CPU use during startup.
Ok, in that case we have two options... 1) invest in faster workers, or 2) bump up the start time timeout. I will leave it up to you to decide. I only wonder if we save money or not when using such slow workers which take much, much longer to run a test job, than having a faster worker and shorter job times.
Depends on: 1364228
Assignee: nobody → gbrown
I am trying to address this in bug 1364228, and am hopeful that changes there will reduce failure frequency by 80%+.
Whiteboard: [stockwell unknown] → [stockwell needswork]
(In reply to Geoff Brown [:gbrown] from comment #118)
> I am trying to address this in bug 1364228, and am hopeful that changes
> there will reduce failure frequency by 80%+.

Nearly all (but not quite all) linux failures have been eliminated.
Assignee: gbrown → nobody
Whiteboard: [stockwell needswork] → [stockwell fixed]
this picked up on the 17th, primarily on osx opt in mochitest*;  :gbrown, can you maybe help narrow this down or do your magic to make this actionable?
Flags: needinfo?(gbrown)
Whiteboard: [stockwell fixed] → [stockwell needswork]
(In reply to Joel Maher ( :jmaher) from comment #128)
> this picked up on the 17th, primarily on osx opt in mochitest*;  :gbrown,
> can you maybe help narrow this down or do your magic to make this actionable?

Maybe a side-effect from moving from Buildbot to Taskcluster?
Looking at the osx-10-10 failures only, I see - 
 - opt only, non-e10s only
 - various mochitests...perhaps biased to mochitest-chrome, mochitest-bc?
 - fails after 2 minutes of trying to connect; successful runs generally connect in a few seconds
 - earliest that I see is https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=00fa4f730bf6b7779a8fb78aaef3e23322f5750a&filter-searchStr=osx, but that certainly didn't cause this.
I'm debugging on try...may take a while.

Also, note whimboo's comments in bug 1374762.
In the failed cases, I find profile-after-change is seen but sessionstore-windows-restored is not.
(In reply to Geoff Brown [:gbrown] from comment #134)
> In the failed cases, I find profile-after-change is seen but
> sessionstore-windows-restored is not.

Yeah, that would indicate that we might have started Firefox in safe mode. The application command doesn't show it:

17:37:54     INFO - Application command: /Users/cltbld/tasks/task_1498003975/build/application/Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/p5/c6r39pb97p5___0gvb4wyhzr00000w/T/tmpPpotiL.mozrunner

But I wonder which code paths could actually trigger the safe mode to become active. Geoff, have you had a chance to see it on a one click loaner? If we can see it with VNC connected that would be great.
(In reply to Henrik Skupin (:whimboo) from comment #135)
> active. Geoff, have you had a chance to see it on a one click loaner? If we
> can see it with VNC connected that would be great.

I have not seen it on a one click loaner. I have only reproduced it with try pushes like https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e781e0c5b0bf42723b7a8f2cd14062916f3b309.

The latest result indicates sessionstore-windows-restored is not received because browser.js's onLoad() is not called:

https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1255

and therefore doesn't notify sessionstore-windows-restored:

https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1313
(In reply to Geoff Brown [:gbrown] from comment #137)
> I have not seen it on a one click loaner. I have only reproduced it with try
> pushes like
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=8e781e0c5b0bf42723b7a8f2cd14062916f3b309.

bc2 and bc3 are failing way more than other tests. So having a look at the VNC screen, while running such a test multiple times, should give us more information. It would be good to know if the browser windows comes up at all, or eg if safe mode is active.
I don't think there is a vnc screen to access- these tests all run in AWS and as gbrown mentioned he hasn't seen this on a one click loaner.
Every one click loaner at least for Linux workers allow you to also connect to the screen via VNC. If he hasn't seen it yet, he might have not run enough test jobs to trigger the behavior. It's all the same environment so I do not understand why a one click loaner should not be helpful here?
:whimboo, a one click loaner is not always realistic to reproduce problems- many times I have had to give up and go to try server, have you seen this failure while inside a one click loaner- maybe you have tips for how to reproduce it.
What is try server? If you mean try builds then those jobs also allow a one click loaner. Again, tests on those run totally the same way. I really don't understand the problem here, and we may want to discuss this outside of this bug.
I just haven't used a one-click loaner for this because it seemed inconvenient: With a try push I can start 100+ mochitest jobs and almost always reproduce this problem a couple of times. I don't fancy sitting on a one-click loaner re-running mochitest 100 times. I'll give it a shot with bc2/bc3...see if I can catch it.
...or not!

It looks like one-click loaners are not available for osx (can select one-click loaner option in treeherder, but disabled in taskcluster).

This happens too infrequently on linux to try to reproduce there.
possibly getting a physical loaner?

this could be a side effect of running tests in taskcluster vs buildbot- the timing looks about right.
There have been no new failures on any platform since June 24 and none on OSX since June 23. That's a weekend, with fewer pushes than normal, but a try push seems to confirm: This isn't happening any more.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb4128adf5fa4d197c868e2f33d5039c18d1eb99

My planned next step here was to try to track down why browser.js' onLoad() was not being called, even though xpcom startup seemed fine. I also wanted to check if this might be related to safe mode, but didn't have a concrete idea of how to pursue either idea.

Hoping this stays fixed...
Flags: needinfo?(gbrown)
My patch for bug 1374762 landed 2 days ago. Since then we haven't had any single failure anymore. So I really wonder if that could have been due to an inappropriate start in safe mode.

Please note that with my patch I added a DEBUG log line for Marionette which prints out when the safe mode dialog gets handled. Sadly be default this will not show up in the logs of other test suites because the preference `marionette.log.level = 'trace'` is not used. Maybe it should be added so that Marionette prints a bit more output in case of failures.

Geoff, what do you think?
Flags: needinfo?(gbrown)
There have been a couple of failures today. :(

...But there has been lots of improvement here, the majority of the improvement around June 24, and perhaps some additional improvement around the time of your safe mode change - that's great!


I find investigating and debugging these "Connection refused" failures tiresome and welcome additional logging that would help narrow the cause of the failure more quickly. Looking at the current Marionette test logs, I'm not convinced the TRACE messages would help much: It looks like "Listening on port 2828" is still the first message, and I don't see much more logging at startup. I'm also concerned by all the TRACE messages while tests are running: Our tests logs are already so big and convoluted, I don't want to see that much more added unless we really need it. Maybe it's best to keep it the way it is? It's easy to add TRACE logs in try pushes.
Flags: needinfo?(gbrown)
Geoff, it should actually be enough to use `marionette.log.level = 'debug'`. Because that is what we actually use to print out other information during startup and page loads. The latter shouldn't matter here because it would only happen for page load requests as done via Marionette, and that's nothing those test harnesses is doing.

Otherwise we could add a info level output for how Marionette has been enabled (command line flag or environment variable). It would indicate that it's running but not necessarily have the server started.
(In reply to Henrik Skupin (:whimboo) from comment #152)
> startup and page loads. The latter shouldn't matter here because it would
> only happen for page load requests as done via Marionette, and that's
> nothing those test harnesses is doing.

Oh, I see - I hadn't thought of that. That sounds okay then.

> Otherwise we could add a info level output for how Marionette has been
> enabled (command line flag or environment variable). It would indicate that
> it's running but not necessarily have the server started.

That sounds good too!
Depends on: 1378209
(In reply to Geoff Brown [:gbrown] from comment #153)
> > startup and page loads. The latter shouldn't matter here because it would
> > only happen for page load requests as done via Marionette, and that's
> > nothing those test harnesses is doing.
> 
> Oh, I see - I hadn't thought of that. That sounds okay then.

I would leave this up to you then given that this is outside of Marionette.

> > Otherwise we could add a info level output for how Marionette has been
> > enabled (command line flag or environment variable). It would indicate that
> > it's running but not necessarily have the server started.
> 
> That sounds good too!

I filed bug 1378209 to get this implemented soon.
So a normal application start looks like this:

[task 2017-07-09T10:27:16.225590Z] 10:27:16     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpABZapw.mozrunner
[task 2017-07-09T10:27:16.246287Z] 10:27:16     INFO - runtests.py | Application pid: 1813
[task 2017-07-09T10:27:16.246599Z] 10:27:16     INFO - TEST-INFO | started process GECKO(1813)
[task 2017-07-09T10:27:18.870075Z] 10:27:18     INFO - GECKO(1813) | 1499596038867	Marionette	INFO	Enabled via --marionette
[task 2017-07-09T10:27:23.520937Z] 10:27:23     INFO - GECKO(1813) | 1499596043514	Marionette	INFO	Listening on port 2828

As you can see Marionette is enabled via the command line flag. But here is the output from a non-working case:

[task 2017-07-09T10:30:16.098939Z] 10:30:16     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpoguz_J.mozrunner
[task 2017-07-09T10:30:16.114279Z] 10:30:16     INFO - runtests.py | Application pid: 2431
[task 2017-07-09T10:30:16.114769Z] 10:30:16     INFO - TEST-INFO | started process GECKO(2431)
[task 2017-07-09T10:33:16.192469Z] 10:33:16     INFO -  Traceback (most recent call last):
[task 2017-07-09T10:33:16.195312Z] 10:33:16     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2534, in doTests
[task 2017-07-09T10:33:16.197115Z] 10:33:16     INFO -      marionette_args=marionette_args,
[task 2017-07-09T10:33:16.199086Z] 10:33:16     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2135, in runApp
[task 2017-07-09T10:33:16.200887Z] 10:33:16     INFO -      self.marionette.start_session(timeout=port_timeout)
[task 2017-07-09T10:33:16.202293Z] 10:33:16     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-07-09T10:33:16.203538Z] 10:33:16     INFO -      m._handle_socket_failure()
[task 2017-07-09T10:33:16.205368Z] 10:33:16     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-07-09T10:33:16.207145Z] 10:33:16     INFO -      return func(*args, **kwargs)
[task 2017-07-09T10:33:16.208370Z] 10:33:16     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1208, in start_session
[task 2017-07-09T10:33:16.210069Z] 10:33:16     INFO -      self.protocol, _ = self.client.connect()
[task 2017-07-09T10:33:16.211347Z] 10:33:16     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 223, in connect
[task 2017-07-09T10:33:16.212672Z] 10:33:16     INFO -      self.sock.connect((self.addr, self.port))
[task 2017-07-09T10:33:16.213892Z] 10:33:16     INFO -    File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2017-07-09T10:33:16.215644Z] 10:33:16     INFO -      return getattr(self._sock,name)(*args)
[task 2017-07-09T10:33:16.216847Z] 10:33:16     INFO -  error: [Errno 111] Connection refused
[task 2017-07-09T10:33:16.218212Z] 10:33:16    ERROR - Automation Error: Received unexpected exception while running application

The Marionette component didn't get started. Even not after 3 minutes! And as such the client times out. This indicates that there is clearly a problem in Firefox, whether with initializing components, or a hang / crash during startup.
Whiteboard: [stockwell needswork] → [stockwell unk
Whiteboard: [stockwell unk → [stockwell unknown]
Depends on: 1382162
Depends on: 1385281
Whiteboard: [stockwell unknown] → [stockwell needswork]
It appears this bug has become a catch all... There are totally unrelated classifications against this bug.
See Also: → 1385822
See Also: → 1385847
It was my impression that mochitests would soon move to install
addons in a different way and not use Marionette?  Has this plan
changed, or am I disillusioned?

From looking at a recent log:

> 16:35:06     INFO - GECKO(1097) | 1502728506714	Marionette	INFO	Enabled via --marionette

[…]

> 16:35:12     INFO -  error: [Errno 111] Connection refused
> 16:35:12    ERROR - Automation Error: Received unexpected exception while running application

In between here, we would have expected to see “Marionette INFO
Listening on port 2828”, indicating that the TCP listener in the
remote protocol has started up.

We use Services.tm.idleDispatchToMainThread to delay starting
up Marionette until Firefox has finished whatever it is
doing on the main thread to avoid potential synchronisation
problems.  You can see the relevant code for this in
https://searchfox.org/mozilla-central/rev/b258e6864ee3e809d40982bc5d0d5aff66a20780/testing/marionette/components/marionette.js#292-311.

I suspect that this check is not relevant to mochitest and that
mochitest may be instrumenting the content shell in such a way that
it never reaches an idle main thread. jmaher, does this sound like a
plausible theory?
Flags: needinfo?(jmaher)
Andreas please see bug 1382162 which might be the underlying issue here too. This is marked as dependency.
:ato, the work to not use Marionette was for Talos only, not mochitest/reftest- that work has been done and mozprofile installs the webextensions.  Talos is different than Mochitest/Reftest in that it has signed addons/webextensions and we can have persistent installations- if we were to sign our addons for the unittests, then we could install them in a different way.
Flags: needinfo?(jmaher)
:automatedtester- can you find someone to look at fixing this?  we have many failures/week showing up from this and marionette connections have been a persistent problem since we starting using marionette.
Flags: needinfo?(dburns)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #173)
> :automatedtester- can you find someone to look at fixing this?  we have many
> failures/week showing up from this and marionette connections have been a
> persistent problem since we starting using marionette.

Once again, all those failures should be related to bug 1382162 which needs to be investigated and fixed first. Geoff spent most time on all this so far but has a very hard time to get it reproduced. I would suggest we are moving any discussion item over to bug 1382162?
I agree, but we need someone to own resolving this issue as that bug has been idle for 10 days.
Comment on attachment 8902637 [details]
Bug 1261598 - Set port timeout to be more than 60s for ASAN.

https://reviewboard.mozilla.org/r/174308/#review179476

::: testing/mozharness/configs/unittests/linux_unittest.py:119
(Diff revision 1)
>                  "--log-errorsummary=%(error_summary_file)s",
>                  "--use-test-media-devices",
>                  "--screenshot-on-fail",
>                  "--cleanup-crashes",
>                  "--marionette-startup-timeout=180",
> +                "--marionette-port-timeout=180"

The port timeout is by default 360s. Why should we reduce it here, so that its value is even clashing with the --marionette-startup-timeout?

I don't see that this will change anything. It only will abort tests earlier.
(In reply to Henrik Skupin (:whimboo) from comment #177)
> Comment on attachment 8902637 [details]
> Bug 1261598 - Set port timeout to be more than 60s for ASAN.
> 
> https://reviewboard.mozilla.org/r/174308/#review179476
> 
> ::: testing/mozharness/configs/unittests/linux_unittest.py:119
> (Diff revision 1)
> >                  "--log-errorsummary=%(error_summary_file)s",
> >                  "--use-test-media-devices",
> >                  "--screenshot-on-fail",
> >                  "--cleanup-crashes",
> >                  "--marionette-startup-timeout=180",
> > +                "--marionette-port-timeout=180"
> 
> The port timeout is by default 360s. Why should we reduce it here, so that
> its value is even clashing with the --marionette-startup-timeout?
> 
> I don't see that this will change anything. It only will abort tests earlier.

The tests are timing out after 60s so its clearly not waiting 360s
Flags: needinfo?(dburns)
(In reply to David Burns :automatedtester from comment #178)
> > ::: testing/mozharness/configs/unittests/linux_unittest.py:119
> > (Diff revision 1)
> > >                  "--log-errorsummary=%(error_summary_file)s",
> > >                  "--use-test-media-devices",
> > >                  "--screenshot-on-fail",
> > >                  "--cleanup-crashes",
> > >                  "--marionette-startup-timeout=180",
> > > +                "--marionette-port-timeout=180"
> > 
> > I don't see that this will change anything. It only will abort tests earlier.
> 
> The tests are timing out after 60s so its clearly not waiting 360s

No, tests are timing out after 3min (180s) which is startup timeout as listed above:

> [task 2017-08-26T08:04:23.069762Z] 08:04:23     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpGylzXD.mozrunner
[..]
[task 2017-08-26T08:04:27.051735Z] 08:04:27     INFO - GECKO(1205) | 1503734667044	Marionette	INFO	Enabled via --marionette
[task 2017-08-26T08:07:29.551916Z] 08:07:29     INFO -  Traceback (most recent call last):

To get more details about this failure we could temporarily do the following:

1) Change Marionette's log lines for received observer notifications from DEBUG to INFO level
2) Update test harnesses which use Marionette to use the DEBUG log level instead of INFO

I would actually prefer option 2) which could simply be done via a preference as already mentioned in comment 150.

But basically it's still bug 1382162 we have to work on, because that is the underlying issue here.
Depends on: 1397734
Please note that some failures like the following are not caused by Marionette but we hit a MOZ_CRASH which causes a shutdown of Firefox.

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127986873&lineNumber=22362

[task 2017-09-02T11:53:03.283917Z] 11:53:03     INFO - GECKO(2785) | Hit MOZ_CRASH(Aborting on channel error.) at /builds/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2535
[task 2017-09-02T11:53:03.290153Z] 11:53:03     INFO - GECKO(2785) | #01: mozilla::ipc::ProcessLink::OnChannelError [ipc/glue/MessageLink.cpp:393]
[task 2017-09-02T11:53:03.290404Z] 11:53:03     INFO - 
[task 2017-09-02T11:53:03.293195Z] 11:53:03     INFO - GECKO(2785) | #02: event_process_active_single_queue [ipc/chromium/src/third_party/libevent/event.c:1648]

Shouldn't we also add MOZ_CRASH to the error list, so that those cases are getting reported correctly on Treeherder? Otherwise we will get falsely starred test failures like in the above case. Ed, can you give feedback?
Flags: needinfo?(emorley)
So what I think goes also wrong here is the usage of Marionette by other test harnesses. I already reported that on bug 1393366 for wpt-tests. What's basically happening here is the following:

By default the startup_timeout of Marionette driver is 120s. BUT as we can see here it errors out already after 60s. The 120s are only used when you make use of `wait_for_port` or `raise_for_port` before the first call to `start_session`. Starting a session defaults to 60s, and uses this value inappropriately to wait for the server to be ready. Instead it would have to use the 120s: 

http://searchfox.org/mozilla-central/rev/44c693914255638d74bcf1ec3b0bcd448dfab2fd/testing/marionette/client/marionette_driver/marionette.py#1173-1202

I will take care of that in my patch on bug 1397675. This should definitely give us 60s more time, which should be sufficient.
Depends on: 1397675
(In reply to Henrik Skupin (:whimboo) from comment #181)
> Shouldn't we also add MOZ_CRASH to the error list, so that those cases are
> getting reported correctly on Treeherder? Otherwise we will get falsely
> starred test failures like in the above case. Ed, can you give feedback?

I agree this sounds useful to have marked as an error (presuming it would result in no false positives). However we're trying to move away from hardcoding error strings in Treeherder, so I would prefer this to be addressed by making the mozharness log level prefix be ERROR rather than the current INFO. eg:

[task 2017-09-02T11:53:03.283917Z] 11:53:03     ERROR - GECKO(2785) | Hit MOZ_CRASH(...
Flags: needinfo?(emorley)
(In reply to Ed Morley [:emorley] from comment #185)
> hardcoding error strings in Treeherder, so I would prefer this to be
> addressed by making the mozharness log level prefix be ERROR rather than the

Sure, I filed bug 1399993.
The patch on bug  1397675 has been landed yesterday. I will observe the results on Monday to see if also fixes this problem.
The number of failures dropped a lot after bug 1397675 got fixed, but there are still issues, and I think this is because of bug 
1382162 (missing session restore notification).
I don't see many failures in the last day, in fact it looks silent for almost 16 hours now.  Lets see if this continues the trend by the end of the week :)
Well, there are still quite a lot of failures. And with each job waiting 3 minutes now for Firefox and Marionette to be ready nothing seem to have changed. This clearly indicates that the build is stuck during startup, and Marionette doesn't start because of it. Note, that even the command line arguments handler is not getting called! 

Most of the test failures are for mochitest-chrome on Windows 10, and similar to bug 1352671 only happening for jobs as run on buildbot. Given that Joel moved those to TaskCluster via bug 1401184 lets see if this changes something.

If not I'm still working on improving the logging of Marionette driver now.
Depends on: 1401184
There are mainly failures which happen for browser chrome tests, Joel is trying to move to TC now. Maybe that helps us.
I can only see failures for browser-chrome jobs yesterday which run via buildbot but there is no single entry for those running via TC now. So moving them from hardware to VM seems to have helped a lot, and that was done via bug 1402068.

Top-most remaining failing jobs seem to be for mochitest-gl-e10s. Joel, is there anything we can do to also port them to VM?
Depends on: 1402068
Flags: needinfo?(jmaher)
for webgl/gpu/reftest we need a working GPU instance with a supported video card- right now there are issues with that which is why these are running on hardware.  That work is tracked in bug 1366288.
Flags: needinfo?(jmaher)
a chance this could still come up on the radar for disable-recommended, but we are under the 200 threshold and I suspect will remain there.
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]
sorry, I was looking at bug 1391545 and accidentally changed this bug.
Whiteboard: [stockwell needswork] → [stockwell disable-recommended]
(In reply to OrangeFactor Robot from comment #202)
> 40 failures in 178 pushes (0.225 failures/push) were associated with this
> bug yesterday.    

Only 5 of those failures are from test jobs as run on TC workers. All the rest is coming from Buildbot workers. Especially the mochitest-gl-e10-* jobs seem to be mostly affected.
Depends on: 1403484
Depends on: 1403616
There are no more failures since September 28th. This is when my patch on bug 1403616 landed.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Whiteboard: [stockwell disable-recommended] → [stockwell fixed:other]
Product: Testing → Remote Protocol
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: