1261598 - Intermittent Automation Error: Received unexpected exception while running application | after error: [Errno 111] Connection refused

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

8 years ago

https://treeherder.mozilla.org/logviewer.html#?job_id=25030028&repo=mozilla-inbound

00:20:46     INFO -  MochitestServer : launching [u'/home/worker/workspace/build/tests/bin/xpcshell', '-g', '/home/worker/workspace/build/application/firefox', '-v', '170', '-f', '/home/worker/workspace/build/tests/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpYwE2t1.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', '/home/worker/workspace/build/tests/mochitest/server.js']
00:20:46     INFO -  runtests.py | Server pid: 1145
00:20:46     INFO -  runtests.py | Websocket server pid: 1148
00:20:46     INFO -  runtests.py | SSL tunnel pid: 1152
00:20:47     INFO -  runtests.py | Running tests: start.
00:20:47     INFO -  runtests.py | Application pid: 1173
00:20:47     INFO -  TEST-INFO | started process Main app process
00:20:47     INFO -  ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /tmp/tmpYwE2t1.mozrunner/runtests_leaks.log
00:20:58     INFO -  ++DOCSHELL 0x7f254ede3800 == 1 [pid = 1173] [id = 1]
00:20:58     INFO -  ++DOMWINDOW == 1 (0x7f254ede4000) [pid = 1173] [serial = 1] [outer = (nil)]
00:20:58     INFO -  [1173] WARNING: No inner window available!: file /home/worker/workspace/build/src/dom/base/nsGlobalWindow.cpp, line 9773
00:20:58     INFO -  ++DOMWINDOW == 2 (0x7f254ede5000) [pid = 1173] [serial = 2] [outer = 0x7f254ede4000]
00:21:48     INFO -  Traceback (most recent call last):
00:21:48     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2357, in doTests
00:21:48     INFO -      marionette_args=marionette_args,
00:21:48     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 1984, in runApp
00:21:48     INFO -      self.marionette.start_session(timeout=port_timeout)
00:21:48     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1166, in start_session
00:21:48     INFO -      self.protocol, _ = self.client.connect()
00:21:48     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 226, in connect
00:21:48     INFO -      self.sock.connect((self.addr, self.port))
00:21:48     INFO -    File "/usr/lib/python2.7/socket.py", line 224, in meth
00:21:49     INFO -      return getattr(self._sock,name)(*args)
00:21:49     INFO -  error: [Errno 111] Connection refused
00:21:49     INFO -  0 ERROR Automation Error: Received unexpected exception while running application

Phil Ringnalda (:philor)

Updated

•

8 years ago

Depends on: 1293093

Comment hidden (Intermittent Failures Robot)

15 automation job failures were associated with this bug yesterday.

Repository breakdown:
* autoland: 13
* mozilla-inbound: 2

Platform breakdown:
* windowsxp: 11
* linux64: 3
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2016-10-28&endday=2016-10-28&tree=all

Comment hidden (Intermittent Failures Robot)

17 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 15
* mozilla-inbound: 2

Platform breakdown:
* windowsxp: 11
* linux64: 3
* windows8-64: 2
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2016-10-24&endday=2016-10-30&tree=all

Comment hidden (Intermittent Failures Robot)

7 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 6
* autoland: 1

Platform breakdown:
* linux64: 3
* linux32: 2
* windowsxp: 1
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2016-11-07&endday=2016-11-13&tree=all

Comment hidden (Intermittent Failures Robot)

6 failures in 609 pushes (0.01 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 3
* mozilla-aurora: 1
* graphics: 1
* autoland: 1

Platform breakdown:
* linux64: 5
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2016-12-19&endday=2016-12-25&tree=all

Comment hidden (Intermittent Failures Robot)

10 failures in 722 pushes (0.014 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* autoland: 5
* mozilla-inbound: 2
* mozilla-central: 1
* mozilla-aurora: 1
* graphics: 1

Platform breakdown:
* windows8-64: 4
* linux64: 4
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-01-09&endday=2017-01-15&tree=all

Comment hidden (Intermittent Failures Robot)

21 failures in 690 pushes (0.03 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 14
* autoland: 4
* mozilla-central: 2
* try: 1

Platform breakdown:
* linux64: 8
* linux32: 6
* windows8-64: 5
* windows7-32-vm: 1
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-01-16&endday=2017-01-22&tree=all

Comment hidden (Intermittent Failures Robot)

14 failures in 749 pushes (0.019 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* autoland: 6
* mozilla-central: 3
* mozilla-inbound: 2
* mozilla-beta: 2
* mozilla-aurora: 1

Platform breakdown:
* linux64: 8
* linux32: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-01-23&endday=2017-01-29&tree=all

Geoff Brown [:gbrown]

Comment 8

•

7 years ago

There's been a significant spike in frequency today, and higher-than-normal frequency for this bug for the last couple of days. It seems like a lot of the recent failures are linux64-asan and gpu or mochitest-gl.

:whimboo -- Do you know what's going on?

Flags: needinfo?(hskupin)

Comment hidden (Intermittent Failures Robot)

45 failures in 140 pushes (0.321 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 29
* mozilla-inbound: 14
* mozilla-central: 1
* graphics: 1

Platform breakdown:
* linux64: 38
* linux32: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-03&endday=2017-02-03&tree=all

Comment hidden (Intermittent Failures Robot)

92 failures in 733 pushes (0.126 failures/push) were associated with this bug in the last 7 days. 

This is the #7 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* autoland: 48
* mozilla-inbound: 33
* mozilla-central: 7
* try: 3
* graphics: 1

Platform breakdown:
* linux64: 73
* linux32: 16
* windows8-64: 1
* windows7-32-vm: 1
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-01-30&endday=2017-02-05&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 11

•

7 years ago

A pattern I see that this is all mochitest only, right? And all the time when it happens we run some tests and then the Mochitest server seems to start a new Firefox process:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=74110423&lineNumber=1840

I don't know those suites to give a clear answer, but maybe this is a shutdown/startup issue? It would be great to see a test job with gecko logging turned on (if possible), if the failure rate stays that high.

Flags: needinfo?(hskupin)

Geoff Brown [:gbrown]

Updated

•

7 years ago

Comment 12

•

7 years ago

:gbrown is this something you could look at in the next week or two?

Flags: needinfo?(gbrown)

Joel Maher ( :jmaher ) (UTC -8)

Comment 13

•

7 years ago

note, this pattern changed around Feb 1st and our rate of hitting this bug increased.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 14

•

7 years ago

Isn't that bug 1318951 what Geoff added a day ago under "see also"?

Joel Maher ( :jmaher ) (UTC -8)

Comment 15

•

7 years ago

I am not clear if that bug is really misstars of this bug or vice versa.  I am under the impression this bug is different from the other and needs to be fixed by someone on our team.

Geoff Brown [:gbrown]

Comment 16

•

7 years ago

Will see what I can do...

Assignee: nobody → gbrown

Flags: needinfo?(gbrown)

Comment hidden (Intermittent Failures Robot)

31 failures in 153 pushes (0.203 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-inbound: 14
* autoland: 13
* mozilla-central: 2
* try: 1
* graphics: 1

Platform breakdown:
* linux64: 28
* linux32: 2
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-09&endday=2017-02-09&tree=all

Geoff Brown [:gbrown]

Comment 18

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #11)
> I don't know those suites to give a clear answer, but maybe this is a
> shutdown/startup issue? It would be great to see a test job with gecko
> logging turned on (if possible), if the failure rate stays that high.

I can reproduce on try fairly easily. How do I turn on gecko logging?

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 19

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #18)
> (In reply to Henrik Skupin (:whimboo) from comment #11)
> > I don't know those suites to give a clear answer, but maybe this is a
> > shutdown/startup issue? It would be great to see a test job with gecko
> > logging turned on (if possible), if the failure rate stays that high.
> 
> I can reproduce on try fairly easily. How do I turn on gecko logging?

With marionette harness itself its just "--gecko-log", but I don't know how other test harnesses are doing that. Maybe Joel knows about?

Flags: needinfo?(hskupin) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 20

•

7 years ago

I have no idea, gbrown had asked me before asking in the bug.

Flags: needinfo?(jmaher)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 21

•

7 years ago

Ok, so I had a look and there is an option when starting Firefox which is "--console". If the test harness you are checking here isn't setting it, please hack it into the mozrunner run_command call.

Geoff Brown [:gbrown]

Comment 22

•

7 years ago

Thanks :whimboo. 

https://treeherder.mozilla.org/#/jobs?repo=try&revision=97becc65975e0567282ab2fb941971ea0cf7132d has --console and reproduces this failure. For example:

https://public-artifacts.taskcluster.net/AxKCpVtaRFGg6Suwe2anEQ/0/public/logs/live_backing.log

[task 2017-02-10T23:38:04.684283Z] 23:38:04     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox --console -marionette -foreground -profile /tmp/tmp1gV_zP.mozrunner
[task 2017-02-10T23:38:04.715424Z] 23:38:04     INFO - runtests.py | Application pid: 1128
[task 2017-02-10T23:38:04.716011Z] 23:38:04     INFO - TEST-INFO | started process Main app process
[task 2017-02-10T23:40:04.754794Z] 23:40:04     INFO -  Traceback (most recent call last):
[task 2017-02-10T23:40:04.755610Z] 23:40:04     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2417, in doTests
[task 2017-02-10T23:40:04.756719Z] 23:40:04     INFO -      marionette_args=marionette_args,
[task 2017-02-10T23:40:04.758099Z] 23:40:04     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2020, in runApp
[task 2017-02-10T23:40:04.759219Z] 23:40:04     INFO -      self.marionette.start_session(timeout=port_timeout)
[task 2017-02-10T23:40:04.760575Z] 23:40:04     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-02-10T23:40:04.762437Z] 23:40:04     INFO -      m._handle_socket_failure()
[task 2017-02-10T23:40:04.763931Z] 23:40:04     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-02-10T23:40:04.765374Z] 23:40:04     INFO -      return func(*args, **kwargs)
[task 2017-02-10T23:40:04.767229Z] 23:40:04     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1284, in start_session
[task 2017-02-10T23:40:04.769433Z] 23:40:04     INFO -      self.protocol, _ = self.client.connect()
[task 2017-02-10T23:40:04.770252Z] 23:40:04     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 223, in connect
[task 2017-02-10T23:40:04.770992Z] 23:40:04     INFO -      self.sock.connect((self.addr, self.port))
[task 2017-02-10T23:40:04.771704Z] 23:40:04     INFO -    File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2017-02-10T23:40:04.776997Z] 23:40:04     INFO -      return getattr(self._sock,name)(*args)
[task 2017-02-10T23:40:04.782340Z] 23:40:04     INFO -  error: [Errno 111] Connection refused
[task 2017-02-10T23:40:04.783482Z] 23:40:04    ERROR - Automation Error: Received unexpected exception while running application

I don't think that helped.

I also added gecko_log to the Marionette arguments, but that didn't seem to work.

I also defined MOZ_LOG with some common nspr logging components ("all" doesn't work currently)...that generated some additional info in the mozLogs.zip artifact, but didn't produce anything that jumps out at me as an explanation for what's going wrong here.

Geoff Brown [:gbrown]

Comment 23

•

7 years ago

I suppose Firefox may be crashing on startup; I think that would explain Connection Refused? I wonder, if runtests.py caught the failure here and continued, would it find the app process still running, would it find a crash dump? I'll look into that next.

Comment hidden (Intermittent Failures Robot)

22 failures in 152 pushes (0.145 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* autoland: 11
* mozilla-inbound: 8
* mozilla-central: 2
* graphics: 1

Platform breakdown:
* linux64: 21
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-10&endday=2017-02-10&tree=all

Geoff Brown [:gbrown]

Comment 25

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=29f0c272a7e067bbcc618186160e05c8d9ce6490
https://public-artifacts.taskcluster.net/B2wHjU8JT8-ah6CJ_i3kCQ/0/public/logs/live_backing.log

Comment hidden (Intermittent Failures Robot)

150 failures in 836 pushes (0.179 failures/push) were associated with this bug in the last 7 days. 

This is the #9 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* mozilla-inbound: 69
* autoland: 58
* mozilla-central: 13
* try: 5
* graphics: 5

Platform breakdown:
* linux64: 140
* linux32: 7
* windows8-64: 2
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-06&endday=2017-02-12&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 27

•

7 years ago

Maybe other test harnesses beside Mn do not correctly handle thrown exceptions when trying to use Marionette for installing an addon? Keep in mind that in such cases Marionette doesn't do any crash checking and such. It would be all up to the harness which using Marionette.

Comment hidden (Intermittent Failures Robot)

21 failures in 114 pushes (0.184 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* autoland: 10
* mozilla-inbound: 9
* mozilla-central: 1
* graphics: 1

Platform breakdown:
* linux64: 21

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-13&endday=2017-02-13&tree=all

Geoff Brown [:gbrown]

Comment 29

•

7 years ago

Recent findings...

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eea99c2033987e186e9d43fd49ff6311b44ad287
Fail: https://public-artifacts.taskcluster.net/SVI8qSOaTTClKpArCIrb5A/0/public/logs/live_backing.log
OK:   https://public-artifacts.taskcluster.net/KDi2NWBzTUWoxvVcCVVYoA/0/public/logs/live_backing.log

I don't think Firefox is crashing:

[task 2017-02-13T02:47:25.051571Z] 02:47:25    ERROR - Force-terminating active process(es).
[task 2017-02-13T02:47:25.052837Z] 02:47:25     INFO - Determining child pids from psutil
[task 2017-02-13T02:47:25.059342Z] 02:47:25     INFO - Found child pids: [1959, 1981]
[task 2017-02-13T02:47:25.060637Z] 02:47:25     INFO - Killing process: 1959
[task 2017-02-13T02:47:25.061522Z] 02:47:25     INFO - TEST-INFO | started process screentopng
[task 2017-02-13T02:47:25.581131Z] 02:47:25     INFO - TEST-INFO | screentopng: exit 0
[task 2017-02-13T02:47:25.585511Z] 02:47:25     INFO - Can't trigger Breakpad, just killing process
[task 2017-02-13T02:47:25.586586Z] 02:47:25     INFO - Killing process: 1981
[task 2017-02-13T02:47:25.588030Z] 02:47:25     INFO - Not taking screenshot here: see the one that was previously logged
[task 2017-02-13T02:47:25.589271Z] 02:47:25     INFO - Can't trigger Breakpad, just killing process
[task 2017-02-13T02:47:26.699131Z] 02:47:26     INFO - psutil found pid 1959 dead
[task 2017-02-13T02:47:26.700186Z] 02:47:26     INFO - psutil found pid 1981 dead
[task 2017-02-13T02:47:26.700899Z] 02:47:26     INFO - Killing process: 1955


"Marionette enabled via command-line flag" is missing from the failed cases:

https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/testing/marionette/components/marionette.js#111

but it should be there:

[task 2017-02-13T02:41:54.105081Z] 02:41:54     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox --console -marionette -foreground -profile /tmp/tmpntlzpC.mozrunner

Comment hidden (Intermittent Failures Robot)

20 failures in 161 pushes (0.124 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* autoland: 11
* mozilla-inbound: 7
* graphics: 2

Platform breakdown:
* linux64: 19
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-14&endday=2017-02-14&tree=all

Geoff Brown [:gbrown]

Comment 31

•

7 years ago

Startup is hanging at https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/toolkit/xre/nsXREDirProvider.cpp#1155:

    em->Observe(nullptr, "addons-startup", nullptr);

DoStartup() is called from https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/toolkit/xre/nsAppRunner.cpp#4289; since that doesn't progress, the command line handlers are never called and marionette is never started.

I'll check into why the addon manager startup appears to be hanging.

Comment hidden (Intermittent Failures Robot)

17 failures in 143 pushes (0.119 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* autoland: 11
* mozilla-inbound: 3
* graphics: 2
* try: 1

Platform breakdown:
* linux64: 17

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-15&endday=2017-02-15&tree=all

Comment hidden (Intermittent Failures Robot)

25 failures in 151 pushes (0.166 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* autoland: 13
* mozilla-inbound: 5
* mozilla-central: 3
* graphics: 3
* mozilla-beta: 1

Platform breakdown:
* linux64: 25

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-16&endday=2017-02-16&tree=all

Geoff Brown [:gbrown]

Comment 34

•

7 years ago

...and addon manager startup hangs due to a hang somewhere in CryptoTask.

Still making slow progress here, tracking down the source of the hang with logging and try pushes.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bfb2e557d65cfb110b70f8b82eb259896a6ccdc1

Comment hidden (Intermittent Failures Robot)

27 failures in 180 pushes (0.15 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-inbound: 13
* autoland: 8
* try: 3
* mozilla-central: 2
* graphics: 1

Platform breakdown:
* linux64: 26
* android-4-3-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-17&endday=2017-02-17&tree=all

Comment hidden (Intermittent Failures Robot)

125 failures in 833 pushes (0.15 failures/push) were associated with this bug in the last 7 days. 

This is the #15 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* autoland: 59
* mozilla-inbound: 43
* graphics: 10
* mozilla-central: 7
* try: 5
* mozilla-beta: 1

Platform breakdown:
* linux64: 122
* windows8-64: 1
* linux32: 1
* android-4-3-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-13&endday=2017-02-19&tree=all

Comment hidden (Intermittent Failures Robot)

28 failures in 141 pushes (0.199 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* autoland: 13
* graphics: 7
* mozilla-central: 4
* mozilla-inbound: 3
* try: 1

Platform breakdown:
* linux64: 14
* linux32: 14

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-21&endday=2017-02-21&tree=all

Comment hidden (Intermittent Failures Robot)

23 failures in 173 pushes (0.133 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* autoland: 10
* mozilla-inbound: 7
* mozilla-central: 5
* graphics: 1

Platform breakdown:
* linux64: 14
* linux32: 9

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-22&endday=2017-02-22&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 39

•

7 years ago

:gbrown, any updates in this bug?  I recall you mentioning some progress, it would be good to document it in here.

Flags: needinfo?(gbrown)

Comment hidden (Intermittent Failures Robot)

105 failures in 812 pushes (0.129 failures/push) were associated with this bug in the last 7 days. 

This is the #11 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* autoland: 48
* mozilla-inbound: 25
* mozilla-central: 16
* graphics: 11
* try: 3
* mozilla-aurora: 2

Platform breakdown:
* linux64: 73
* linux32: 30
* windows7-32-vm: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-20&endday=2017-02-26&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell needswork]

Geoff Brown [:gbrown]

Comment 41

•

7 years ago

tl;dr - During automated test runs, browser startup sometimes hangs at https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1596; the CryptoTask that it is waiting for was called off main thread, called NS_DispatchToMainThread(), but is not called again.


The details:
 - Marionette's connection is refused because the marionette server (in firefox) is not running.
 - Marionette is not running because it was never started; it was not started because firefox startup has hung before processing command line arguments like '-marionette'.
 - Startup is hung at https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/toolkit/xre/nsXREDirProvider.cpp#1155:

    em->Observe(nullptr, "addons-startup", nullptr);

 - Addons startup is hung at:

 syncLoadManifestFromFile@resource://gre/modules/addons/XPIProvider.jsm:1598:16
 processPendingFileChanges@resource://gre/modules/addons/XPIProvider.jsm:3408:19
 checkForChanges@resource://gre/modules/addons/XPIProvider.jsm:3698:19
 startup@resource://gre/modules/addons/XPIProvider.jsm:2802:25
 callProvider@resource://gre/modules/AddonManager.jsm:272:12
 _startProvider@resource://gre/modules/AddonManager.jsm:756:5
 startup@resource://gre/modules/AddonManager.jsm:939:9
 startup@resource://gre/modules/AddonManager.jsm:3136:5
 observe@resource://gre/components/addonManager.js:65:9

Sorry, those line numbers were modified by my logging in https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1ecf27fcebc97a0c7d997a3e2f06631d5c24786...so, addons startup is hung at:

https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1596

  while (success === undefined)
    thread.processNextEvent(true);

I can reproduce this failure on try, on linux64-asan. If I trigger a crash from the harness when the marionette connection times out, I consistently get a stack in nsAppShell::ProcessNextNativeEvent:

https://public-artifacts.taskcluster.net/VRVjeVOzSxuOGwAywnDkaA/0/public/logs/live_backing.log

#26 0x7f4954a044ab in g_main_context_iteration (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x4a4ab)
#27 0x7f4946de010e in nsAppShell::ProcessNextNativeEvent(bool) /home/worker/workspace/build/src/widget/gtk/nsAppShell.cpp:270:12
#28 0x7f4946d3d114 in DoProcessNextNativeEvent /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:138:17
#29 0x7f4946d3d114 in nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:271
#30 0x7f4946d3dbaf in non-virtual thunk to nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:233:17
#31 0x7f4940b2e7f4 in nsThread::ProcessNextEvent(bool, bool*) /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp:1220:5

That seems consistent/unsurprising.

 - Meanwhile, we have this happening:
CryptoTask::Run(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/security/manager/ssl/CryptoTask.cpp#55
...
openSignedAppFileAsync(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/security/apps/AppSignatureVerification.cpp#1044
verifyZipSignedState(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1829
loadManifestFromZipReader(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1527
loadManifestFromFile(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1573

In all failures, CryptoTask::Run() is called off main thread and completes the call to NS_DispatchToMainThread(), then is never called again (is not called on the main thread).


My interpretation is that CryptoTask::Run() is waiting for the main thread, but the main thread is blocked on thread.processNextEvent(true)...but I'm out of my depth and unsure of the details.

If I change to thread.processNextEvent(false), I can no longer reproduce the hang, but that doesn't seem like a sensible solution.


:mossop -- It looks like you might know this code. Can you help me understand the issues, or suggest a solution?

Flags: needinfo?(gbrown) → needinfo?(dtownsend)

Dave Townsend [:mossop]

Comment 42

•

7 years ago

I don't know the crypto code or why it might be blocked on the main thread, you probably want Dan for that.

Flags: needinfo?(dtownsend) → needinfo?(dveditz)

Comment hidden (Intermittent Failures Robot)

16 failures in 87 pushes (0.184 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* autoland: 8
* mozilla-central: 5
* mozilla-inbound: 2
* try: 1

Platform breakdown:
* linux64: 16

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-28&endday=2017-02-28&tree=all

Comment hidden (Intermittent Failures Robot)

15 failures in 149 pushes (0.101 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-inbound: 4
* graphics: 4
* autoland: 4
* try: 2
* mozilla-central: 1

Platform breakdown:
* linux64: 14
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-02&endday=2017-03-02&tree=all

Comment hidden (Intermittent Failures Robot)

74 failures in 783 pushes (0.095 failures/push) were associated with this bug in the last 7 days. 

This is the #15 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* autoland: 36
* mozilla-inbound: 16
* graphics: 8
* mozilla-central: 7
* try: 5
* oak: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 73
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-02-27&endday=2017-03-05&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 46

•

7 years ago

no activity here in a week, checking in as this is still a very high frequency failure.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 47

•

7 years ago

This is actually happening nearly all the time for Linux64 and I wonder if this is related to bug 1293093.

Reason is:

[task 2017-03-05T23:54:21.632367Z] 23:54:21     INFO - runtests.py | Application pid: 1642
[task 2017-03-05T23:54:21.633212Z] 23:54:21     INFO - TEST-INFO | started process Main app process
[task 2017-03-05T23:56:21.722802Z] 23:56:21     INFO -  Traceback (most recent call last):

We try to start Firefox but fail after 2 minutes.

Depends on: 1345105

Geoff Brown [:gbrown]

Comment 48

•

7 years ago

Comment 41 documents the issue: A deadlock involving crypto code waiting for the main thread during startup.

Bug 1345105 may help (or worsen) the behavior by changing startup timing, but will not address the issue.

I'm waiting on the needinfo to dveditz.

Comment hidden (Intermittent Failures Robot)

37 failures in 137 pushes (0.27 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-aurora: 26
* mozilla-inbound: 4
* mozilla-central: 3
* try: 2
* autoland: 2

Platform breakdown:
* windows7-32-vm: 27
* linux64: 10

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-09&endday=2017-03-09&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 50

•

7 years ago

I think David Keeler can also help us here.

Flags: needinfo?(dkeeler)

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Comment 51

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #41)
...
> In all failures, CryptoTask::Run() is called off main thread and completes
> the call to NS_DispatchToMainThread(), then is never called again (is not
> called on the main thread).
> 
> 
> My interpretation is that CryptoTask::Run() is waiting for the main thread,
> but the main thread is blocked on thread.processNextEvent(true)...but I'm
> out of my depth and unsure of the details.

It's not waiting for the main thread - it's waiting to run on the main thread (which processNextEvent should do).

> If I change to thread.processNextEvent(false), I can no longer reproduce the
> hang, but that doesn't seem like a sensible solution.

This turns an idle wait into a busy wait.

It seems like what's happening here is that the CryptoTask is being added to the main thread's event queue but for some reason processNextEvent doesn't process the event unless called repeatedly. Maybe running with the nsThread log set to debug would be informative?

Flags: needinfo?(dkeeler)

Comment hidden (Intermittent Failures Robot)

57 failures in 172 pushes (0.331 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 52
* mozilla-inbound: 2
* mozilla-central: 1
* graphics: 1
* autoland: 1

Platform breakdown:
* windows7-32-vm: 49
* windows7-32: 4
* linux64: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-10&endday=2017-03-10&tree=all

Comment hidden (Intermittent Failures Robot)

134 failures in 790 pushes (0.17 failures/push) were associated with this bug in the last 7 days. 

This is the #6 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-aurora: 91
* autoland: 16
* mozilla-inbound: 12
* mozilla-central: 7
* try: 4
* graphics: 3
* mozilla-beta: 1

Platform breakdown:
* windows7-32-vm: 85
* linux64: 34
* windows7-32: 7
* linux32: 6
* osx-10-10: 1
* linux64-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-06&endday=2017-03-12&tree=all

Comment hidden (Intermittent Failures Robot)

31 failures in 142 pushes (0.218 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 26
* autoland: 3
* try: 1
* mozilla-central: 1

Platform breakdown:
* windows7-32-vm: 22
* linux64: 4
* windows7-32: 3
* linux64-jsdcov: 1
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-13&endday=2017-03-13&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 55

•

7 years ago

I am not sure who is owning fixing this bug?  :gbrown is this something you can do, or do you need someone like :keeler to fix this?

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 56

•

7 years ago

Sorry, couldn't look at this yesterday; it remains a high priority for me.

I will try to reproduce with nsThread logging, as suggested in comment 51.

I am happy to hand-off this bug to anyone who is interested, but I think it remains unclear what component is to blame (we know where browser startup is hanging now, but don't fully understand why).

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 57

•

7 years ago

The platform breakdown in comment 54 is quite different from earlier. Curiously, this bug remains mostly Linux on trunk, but spiked suddenly on aurora March 6-9, primarily on Windows PGO.

Geoff Brown [:gbrown]

Comment 58

•

7 years ago

The aurora/windows-pgo failures look quite different:

[Errno 10061] No connection could be made because the target machine actively refused it

Those started with https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=0931190cd725a744b74f8f47863b7b38be8e8ebf&filter-searchStr=windows+pgo. I understand tomcat is looking into it.


I'll remain focused on the Linux failures.

Comment hidden (Intermittent Failures Robot)

36 failures in 149 pushes (0.242 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 31
* mozilla-inbound: 2
* autoland: 2
* mozilla-central: 1

Platform breakdown:
* windows7-32-vm: 26
* windows7-32: 4
* linux64: 4
* windows8-64: 1
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-14&endday=2017-03-14&tree=all

Geoff Brown [:gbrown]

Comment 60

•

7 years ago

In https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce67f735b6525cb5cb6a9a4f629a205a6ffc7008, I added nsThread logging, which I see in mozLogs.zip files for all the successful test runs. Unfortunately, for the failed cases, like https://treeherder.mozilla.org/logviewer.html#?job_id=83862794&repo=try, mozLogs.zip does not contain any traces for the failed pid.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 61

•

7 years ago

Well, for successful jobs we clearly see that events are getting processed on the main thread:

[task 2017-03-15T00:51:03.032162Z] 00:51:03     INFO - runtests.py | Application pid: 1130
[task 2017-03-15T00:51:03.032244Z] 00:51:03     INFO - TEST-INFO | started process GECKO(1130)
[task 2017-03-15T00:51:04.063991Z] 00:51:04     INFO - GECKO(1130) | ZZZ XREMain::XRE_mainRun calls dirprovider.DoStartup
[task 2017-03-15T00:51:04.105691Z] 00:51:04     INFO - GECKO(1130) | nsXREDirProvider::DoStartup calls addons Observe for addons-startup on main thread
[task 2017-03-15T00:51:04.731357Z] 00:51:04     INFO - GECKO(1130) | ZZZ OpenSignedAppFileAsync
[task 2017-03-15T00:51:04.732343Z] 00:51:04     INFO - GECKO(1130) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T00:51:04.733710Z] 00:51:04     INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T00:51:04.735141Z] 00:51:04     INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T00:51:04.749637Z] 00:51:04     INFO - GECKO(1130) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T00:51:04.749777Z] 00:51:04     INFO - GECKO(1130) | ZZZ CryptoTask::Run 0x60b000041ee0 - NS_DispatchToMainThread okay
[task 2017-03-15T00:51:04.765823Z] 00:51:04     INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T00:51:04.766754Z] 00:51:04     INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0

Whereby for failing jobs the following is happening:

[task 2017-03-15T01:33:50.248452Z] 01:33:50     INFO - runtests.py | Application pid: 1804
[task 2017-03-15T01:33:50.248541Z] 01:33:50     INFO - TEST-INFO | started process GECKO(1804)
[task 2017-03-15T01:33:51.090452Z] 01:33:51     INFO - GECKO(1804) | ZZZ XREMain::XRE_mainRun calls dirprovider.DoStartup
[task 2017-03-15T01:33:51.126704Z] 01:33:51     INFO - GECKO(1804) | nsXREDirProvider::DoStartup calls addons Observe for addons-startup on main thread
[task 2017-03-15T01:33:51.756964Z] 01:33:51     INFO - GECKO(1804) | ZZZ OpenSignedAppFileAsync
[task 2017-03-15T01:33:51.757053Z] 01:33:51     INFO - GECKO(1804) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T01:33:51.757988Z] 01:33:51     INFO - GECKO(1804) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T01:33:51.758663Z] 01:33:51     INFO - GECKO(1804) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T01:33:51.782045Z] 01:33:51     INFO - GECKO(1804) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T01:33:51.785720Z] 01:33:51     INFO - GECKO(1804) | ZZZ CryptoTask::Run 0x60b000041e30 - NS_DispatchToMainThread okay
[task 2017-03-15T01:35:50.279997Z] 01:35:50     INFO -  Traceback (most recent call last):

The crypto task got dispatched to the main thread, right? But then we have 2 minutes without any processing of events. At least this proves the assumptions so far.

Would it help to enable all kind of logging, so we might see what is blocking the main thread from processing events?

Flags: needinfo?(dkeeler)

Geoff Brown [:gbrown]

Comment 62

•

7 years ago

Thanks for having a look, whimboo. 

I tried changing MOZ_LOG to "all" - I got no mozLogs.zip output at all! https://treeherder.mozilla.org/#/jobs?repo=try&revision=9de48e76eef69082f8d67c2f52ad0372ee6cb11f

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 63

•

7 years ago

You missed to specify the level. So "all:5" should do it.

Geoff Brown [:gbrown]

Comment 64

•

7 years ago

I did that: https://hg.mozilla.org/try/rev/3c1b7908bfaa2fa8ebfad9ec1dfd126f2be696b4#l4.13. It just doesn't seem to work.

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Comment 65

•

7 years ago

Looks like some of those tests are faling with "Log file size 83275573 is greater than max allowed 52428800! Setting TBPL_FAILURE (was SUCCESS)..." - could that be preventing the log file from being uploaded? You could try just enabling some logs (although it looks like you did that in comment 60 and it didn't help...) . In PSM, the relevant ones would be pipnss:4 and maybe certverifier:5.

Flags: needinfo?(dkeeler)

Geoff Brown [:gbrown]

Comment 66

•

7 years ago

Rather than fiddle with MOZ_LOG further, I just changed the LOG statements in nsThread.cpp into printf's:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=42299ff365d4d3827a19ac8713e0a95a8e073435

That causes even more "Log file size is greater than max allowed" failures, and also seems to make it more difficult to reproduce this hang, but I was still able to produce a hang and a corresponding successful run:


Hang:

https://public-artifacts.taskcluster.net/TCIijhPiSY-yvA3Dv8M8gg/0/public/logs/live_backing.log

[task 2017-03-15T19:54:10.847829Z] 19:54:10     INFO - GECKO(2276) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T19:54:10.848564Z] 19:54:10     INFO - GECKO(2276) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:54:10.849391Z] 19:54:10     INFO - GECKO(2276) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:54:10.850233Z] 19:54:10     INFO - GECKO(2276) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:54:10.871748Z] 19:54:10     INFO - GECKO(2276) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T19:54:10.874809Z] 19:54:10     INFO - GECKO(2276) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:54:10.875728Z] 19:54:10     INFO - GECKO(2276) | ZZZ CryptoTask::Run 0x60b0000420f0 - NS_DispatchToMainThread okay
[task 2017-03-15T19:54:10.876495Z] 19:54:10     INFO - GECKO(2276) | THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-15T19:54:10.877298Z] 19:54:10     INFO - GECKO(2276) | THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:55:00.167762Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-15T19:55:00.168578Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.169364Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) running [0x621000643d20]
[task 2017-03-15T19:55:00.170168Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.170924Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-15T19:55:00.171643Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.172459Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) running [0x607000065580]
[task 2017-03-15T19:55:00.173193Z] 19:55:00     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.173760Z] 19:55:00     INFO - GECKO(2276) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:55:10.173609Z] 19:55:10     INFO - GECKO(2276) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-15T19:55:10.174726Z] 19:55:10     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:10.174793Z] 19:55:10     INFO - GECKO(2276) | THRD(0x6120000d7d40) running [0x621000643d20]
[task 2017-03-15T19:55:10.174852Z] 19:55:10     INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:56:09.314943Z] 19:56:09     INFO -  Traceback (most recent call last):
[task 2017-03-15T19:56:09.315267Z] 19:56:09     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2491, in doTests
[task 2017-03-15T19:56:09.315942Z] 19:56:09     INFO -      marionette_args=marionette_args,
[task 2017-03-15T19:56:09.317786Z] 19:56:09     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2095, in runApp
[task 2017-03-15T19:56:09.317854Z] 19:56:09     INFO -      self.marionette.start_session(timeout=port_timeout)
[task 2017-03-15T19:56:09.318038Z] 19:56:09     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-03-15T19:56:09.318753Z] 19:56:09     INFO -      m._handle_socket_failure()
[task 2017-03-15T19:56:09.319745Z] 19:56:09     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-03-15T19:56:09.320979Z] 19:56:09     INFO -      return func(*args, **kwargs)
[task 2017-03-15T19:56:09.322240Z] 19:56:09     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1292, in start_session
[task 2017-03-15T19:56:09.323561Z] 19:56:09     INFO -      self.protocol, _ = self.client.connect()
[task 2017-03-15T19:56:09.325582Z] 19:56:09     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 223, in connect
[task 2017-03-15T19:56:09.326296Z] 19:56:09     INFO -      self.sock.connect((self.addr, self.port))
[task 2017-03-15T19:56:09.326902Z] 19:56:09     INFO -    File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2017-03-15T19:56:09.327380Z] 19:56:09     INFO -      return getattr(self._sock,name)(*args)
[task 2017-03-15T19:56:09.327855Z] 19:56:09     INFO -  error: [Errno 111] Connection refused


Success:

[task 2017-03-15T19:59:39.346412Z] 19:59:39     INFO - GECKO(2207) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T19:59:39.352121Z] 19:59:39     INFO - GECKO(2207) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T19:59:39.353319Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:59:39.354864Z] 19:59:39     INFO - GECKO(2207) | ZZZ CryptoTask::Run 0x60b000042040 - NS_DispatchToMainThread okay
[task 2017-03-15T19:59:39.355903Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-15T19:59:39.356826Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.379060Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x607000000f00]
[task 2017-03-15T19:59:39.381582Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.382870Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.383540Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.384713Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.385528Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x604000017390]
[task 2017-03-15T19:59:39.386594Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.387948Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.389559Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x604000040f10]
[task 2017-03-15T19:59:39.390943Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.391911Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.393182Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x6120000c5d40]
[task 2017-03-15T19:59:39.395848Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:59:39.396906Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.398041Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.398829Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.399920Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x6040000bde90]
[task 2017-03-15T19:59:39.400867Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.402731Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.403491Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x607000065740]
[task 2017-03-15T19:59:39.404278Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.405062Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.405678Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x60700006d150]
[task 2017-03-15T19:59:39.406488Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.407274Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.408085Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x6040000f3990]
[task 2017-03-15T19:59:39.408950Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.416002Z] 19:59:39     INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.418090Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x60b000042040]
[task 2017-03-15T19:59:39.418860Z] 19:59:39     INFO - GECKO(2207) | ZZZ CryptoTask::Run on main thread
[task 2017-03-15T19:59:39.421100Z] 19:59:39     INFO - GECKO(2207) | ZZZ OpenSignedAppFileTask::CallCallback called
[task 2017-03-15T19:59:39.422265Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) sync shutdown
[task 2017-03-15T19:59:39.422841Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) running [0x6060002753c0]
[task 2017-03-15T19:59:39.423602Z] 19:59:39     INFO - GECKO(2207) | THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-15T19:59:39.425564Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:59:39.426182Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 1]
[task 2017-03-15T19:59:39.428992Z] 19:59:39     INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]

Comment hidden (Intermittent Failures Robot)

23 failures in 156 pushes (0.147 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 20
* autoland: 2
* try: 1

Platform breakdown:
* windows7-32-vm: 21
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-15&endday=2017-03-15&tree=all

Geoff Brown [:gbrown]

Comment 68

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f048161f5db572f518b449fad9d2399b5bcf8eb0 has thread names plus some additional logging. 

In the failed cases, "Cache2 I/O" is the last thread to call ProcessNextEvent before the hang, but that ProcessNextEvent call seems to complete - not clear that there's any problem there. In the failed cases, the main thread is last seen calling ProcessNextEvent just before OpenSignedAppFileAsync is called; I still don't see what happens to that thread.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 69

•

7 years ago

Well, the interesting piece here is the timing. Have a look at the following excerpt:

[task 2017-03-16T04:05:52.028141Z] 04:05:52     INFO - GECKO(1128) | [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-16T04:05:52.028184Z] 04:05:52     INFO - GECKO(1128) | ==> return
[task 2017-03-16T04:05:52.028236Z] 04:05:52     INFO - GECKO(1128) | [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
[task 2017-03-16T04:06:41.348327Z] 04:06:41     INFO - GECKO(1128) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-16T04:06:41.348608Z] 04:06:41     INFO - GECKO(1128) | [Cache2 I/O] THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-16T04:06:41.349038Z] 04:06:41     INFO - GECKO(1128) | THRD(0x6120000d7d40) running [0x621000643d20]
[task 2017-03-16T04:06:41.350368Z] 04:06:41     INFO - GECKO(1128) | ==> return

There is a hang of about 50s between the return of the SingedJAR thread, and dispatching and running the Cache2 I/O thread.

Would strace maybe give us further information?

Geoff Brown [:gbrown]

Comment 70

•

7 years ago

That is an interesting pause, present in the failures and not present in the successful runs.

strace might produce something interesting...do you know how to set that up, whimboo?

Comment hidden (Intermittent Failures Robot)

29 failures in 132 pushes (0.22 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-aurora: 20
* autoland: 4
* try: 1
* mozilla-inbound: 1
* mozilla-esr52: 1
* mozilla-central: 1
* graphics: 1

Platform breakdown:
* windows7-32-vm: 18
* linux64: 6
* windows7-32: 3
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-16&endday=2017-03-16&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 72

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #70)
> strace might produce something interesting...do you know how to set that up,
> whimboo?

Sadly not. But Dustin gave a quick hint in bug 1331198 comment 16. So something like "strace -p <firefox pid>" should be enough. Finally I haven't had to use it to fix the problem.

Comment hidden (Intermittent Failures Robot)

16 failures in 138 pushes (0.116 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* try: 7
* autoland: 5
* mozilla-inbound: 2
* mozilla-aurora: 2

Platform breakdown:
* linux64: 8
* windows7-32-vm: 3
* linux32: 3
* windows8-64: 1
* windows10-64-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-17&endday=2017-03-17&tree=all

Comment hidden (Intermittent Failures Robot)

148 failures in 777 pushes (0.19 failures/push) were associated with this bug in the last 7 days. 

This is the #8 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-aurora: 100
* autoland: 20
* try: 10
* mozilla-inbound: 8
* mozilla-central: 8
* mozilla-esr52: 1
* graphics: 1

Platform breakdown:
* windows7-32-vm: 93
* linux64: 34
* windows7-32: 10
* linux32: 7
* windows8-64: 2
* windows10-64-vm: 1
* linux64-jsdcov: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-13&endday=2017-03-19&tree=all

Geoff Brown [:gbrown]

Comment 75

•

7 years ago

The Aurora/Windows failures were more properly bug 1345904, apparently resolved by https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=6fb7eeaf573e8f25d1cf77900f4daaeb5f90c2c3.

Geoff Brown [:gbrown]

Comment 76

•

7 years ago

My latest try push, with lots of logging: https://treeherder.mozilla.org/#/jobs?repo=try&revision=331bfa1250d8ef90a5a6a684f404d5b4d9d40b1c

Note that "-b o -p linux64-asan -u mochitest-gpu -t none --rebuild 20" seems to reliably reproduce the hang.

Geoff Brown [:gbrown]

Comment 77

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #69)
> Well, the interesting piece here is the timing. Have a look at the following
> excerpt:
> 
> [task 2017-03-16T04:05:52.028141Z] 04:05:52     INFO - GECKO(1128) |
> [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
> [task 2017-03-16T04:05:52.028184Z] 04:05:52     INFO - GECKO(1128) | ==>
> return
> [task 2017-03-16T04:05:52.028236Z] 04:05:52     INFO - GECKO(1128) |
> [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
> [task 2017-03-16T04:06:41.348327Z] 04:06:41     INFO - GECKO(1128) |
> THRD(0x6120000d7d40) Dispatch [(nil) 0]
> [task 2017-03-16T04:06:41.348608Z] 04:06:41     INFO - GECKO(1128) | [Cache2
> I/O] THRD(0x6120000d7d40) ProcessNextEvent [0 0]
> [task 2017-03-16T04:06:41.349038Z] 04:06:41     INFO - GECKO(1128) |
> THRD(0x6120000d7d40) running [0x621000643d20]
> [task 2017-03-16T04:06:41.350368Z] 04:06:41     INFO - GECKO(1128) | ==>
> return
> 
> There is a hang of about 50s between the return of the SingedJAR thread, and
> dispatching and running the Cache2 I/O thread.

At first I thought that was interesting too, but actually, I think there's no reason to expect the Cache2 thread to be active immediately after the SignedJAR thread event completes. In successful runs, there is also a long pause between Cache2 activity. (That may not be apparent in successful runs because the main thread, and other threads, are continuously active.)

Geoff Brown [:gbrown]

Comment 78

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #76)
> My latest try push, with lots of logging:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=331bfa1250d8ef90a5a6a684f404d5b4d9d40b1c

Looking at one of the failures:

https://public-artifacts.taskcluster.net/G4-HSvXAQV6KEbrO6N_yGw/0/public/logs/live_backing.log

[task 2017-03-22T18:05:22.879544Z] 18:05:22     INFO - GECKO(1468) | nsXREDirProvider::DoStartup calls addons Observe for addons-startup on main thread
[task 2017-03-22T18:05:23.138075Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.478692Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.480116Z] 18:05:23     INFO - GECKO(1468) | 1490205923472	addons.xpi	WARN	syncLoadManifestFromFile waiting for completion
[task 2017-03-22T18:05:23.480809Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] THRD(0x61400000ce40) ProcessNextEvent [1 0]
[task 2017-03-22T18:05:23.482189Z] 18:05:23     INFO - GECKO(1468) | ZZZ OpenSignedAppFileAsync on main thread
[task 2017-03-22T18:05:23.483832Z] 18:05:23     INFO - GECKO(1468) | THRD(0x6140000a8640) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.485374Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) CryptoTask::Dispatch OK
[task 2017-03-22T18:05:23.485962Z] 18:05:23     INFO - GECKO(1468) | ZZZ OpenSignedAppFileAsync - SignedJAR Dispatch OK
[task 2017-03-22T18:05:23.486627Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.486726Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] THRD(0x61400000ce40) ProcessNextEvent calls OnProcessNextEvent [1]
[task 2017-03-22T18:05:23.487338Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent, mayWait=0
[task 2017-03-22T18:05:23.490190Z] 18:05:23     INFO - GECKO(1468) | [SignedJAR] THRD(0x6140000a8640) ProcessNextEvent [0 0]
[task 2017-03-22T18:05:23.490953Z] 18:05:23     INFO - GECKO(1468) | THRD(0x6140000a8640) running [0x60b000041a10]
[task 2017-03-22T18:05:23.491779Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent returns 1
[task 2017-03-22T18:05:23.492888Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent, mayWait=0
[task 2017-03-22T18:05:23.518732Z] 18:05:23     INFO - GECKO(1468) | OpenSignedAppFile returns NS_OK
[task 2017-03-22T18:05:23.521801Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) OpenSignedAppFileTask::CalculateResult returns
[task 2017-03-22T18:05:23.522734Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.523675Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) CryptoTask::Run - NS_DispatchToMainThread okay
[task 2017-03-22T18:05:23.524548Z] 18:05:23     INFO - GECKO(1468) | ==> THRD(0x6140000a8640) ProcessNextEvent return
[task 2017-03-22T18:05:23.525419Z] 18:05:23     INFO - GECKO(1468) | [SignedJAR] THRD(0x6140000a8640) ProcessNextEvent [0 0]
[task 2017-03-22T18:05:23.526485Z] 18:05:23     INFO - GECKO(1468) | ==> THRD(0x6140000a8640) ProcessNextEvent return
[task 2017-03-22T18:05:23.527467Z] 18:05:23     INFO - GECKO(1468) | [SignedJAR] THRD(0x6140000a8640) ProcessNextEvent [1 0]
[task 2017-03-22T18:06:12.826174Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400008fe40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:12.827068Z] 18:06:12     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.828077Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400008fe40) running [0x621000643d20]
[task 2017-03-22T18:06:12.828912Z] 18:06:12     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.829729Z] 18:06:12     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.830412Z] 18:06:12     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.831404Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400008fe40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:12.832129Z] 18:06:12     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.832858Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400008fe40) running [0x607000065580]
[task 2017-03-22T18:06:12.833615Z] 18:06:12     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.834452Z] 18:06:12     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.835233Z] 18:06:12     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.835908Z] 18:06:12     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:22.845240Z] 18:06:22     INFO - GECKO(1468) | THRD(0x61400008fe40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:22.845358Z] 18:06:22     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:22.846286Z] 18:06:22     INFO - GECKO(1468) | THRD(0x61400008fe40) running [0x621000643d20]
[task 2017-03-22T18:06:22.846355Z] 18:06:22     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:22.846815Z] 18:06:22     INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:22.847242Z] 18:06:22     INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:07:22.068483Z] 18:07:22     INFO -  Traceback (most recent call last):

Note particularly this part:

[task 2017-03-22T18:05:23.491779Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent returns 1
[task 2017-03-22T18:05:23.492888Z] 18:05:23     INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent, mayWait=0
[task 2017-03-22T18:05:23.518732Z] 18:05:23     INFO - GECKO(1468) | OpenSignedAppFile returns NS_OK
[task 2017-03-22T18:05:23.521801Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) OpenSignedAppFileTask::CalculateResult returns
[task 2017-03-22T18:05:23.522734Z] 18:05:23     INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.523675Z] 18:05:23     INFO - GECKO(1468) | (0x60b000041a10) CryptoTask::Run - NS_DispatchToMainThread okay

https://hg.mozilla.org/try/rev/51e6963f68a269fc69c76f9844f588a94fd58af7#l7.18

nsAppShell::ProcessNextNativeEvent(), running on the main thread, called g_main_context_iteration() but that never returned. OpenSignedAppFile() and OpenSignedAppFileTask::CalculateResult() completed immediately after, but that's not on the main thread. What happened to the main thread?


:keeler -- Can you have another look? I don't know what else to do here.

Assignee: gbrown → nobody

Flags: needinfo?(dkeeler)

Joel Maher ( :jmaher ) (UTC -8)

Comment 79

•

7 years ago

do we have an idea of what tests are causing this?  Maybe we can narrow it down that way.

Geoff Brown [:gbrown]

Comment 80

•

7 years ago

I don't think there is any connection to individual tests -- this is a hang on browser startup, well before any tests are run. It might be affected by add-ons or anything else configured in the test profile.

It seems to affect all mochitests, but hits mochitest-gl and mochitest-gpu more often than plain - I don't know why.

I wouldn't be surprised if there is a similar issue affecting reftests that is reported differently, but I haven't noticed anything specific.

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Comment 81

•

7 years ago

To test out the assumption that something in the threading of OpenSignedAppFile was causing the timeouts, I wrote a patch that (I believe) made it synchronous rather than async. With 30 retriggers, the intermittent didn't show up:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e24de258c0bca34c852af74bdf61ee1b668e85ea

This then reminded me of bug 1185489, which involves code in this area that isn't thread-safe. I wrote a different patch that just removed the code in question. Again, after 30 retriggers, the intermittent didn't show up:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb748ae85e71956819f37502008d508813682cd2

To convince myself that these results were significant, I attempted to reproduce the intermittent on the base revision that both of those patches are based on:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=32bad569ddff5bc4c1be3dd7ed0122090af401ec&filter-searchStr=gpu

.. and unfortunately I haven't been able to make that fail, despite that being the same code as this:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=4c987b7ed54a&filter-searchStr=linux%20x64%20asan%20gpu

which does have a failure.

I'll keep looking, but unfortunately I really don't have anything to add at the moment.

Flags: needinfo?(dkeeler)

Comment hidden (Intermittent Failures Robot)

41 failures in 898 pushes (0.046 failures/push) were associated with this bug in the last 7 days. 

This is the #32 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 14
* mozilla-inbound: 13
* mozilla-central: 5
* try: 4
* graphics: 2
* mozilla-esr52: 1
* mozilla-beta: 1
* mozilla-aurora: 1

Platform breakdown:
* linux64: 21
* windows7-32-vm: 8
* linux32: 5
* windows8-64: 4
* windows7-32: 2
* linux64-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-20&endday=2017-03-26&tree=all

Geoff Brown [:gbrown]

Comment 83

•

7 years ago

Thanks :keeler. Since my logging seems to make this happen with greater regularity, here are try pushes with my logging on its own + with each of your patches: https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown@mozilla.com&tochange=86612c999285064cdc09abcb3c7398e1069089aa&fromchange=032ca845c994d6d9118c1db376daa0072d72fac6

Geoff Brown [:gbrown]

Comment 84

•

7 years ago

...your "convert openSignedAppFileAsync to sync" patch seems to fix the hang!

Geoff Brown [:gbrown]

Comment 85

•

7 years ago

:keeler - Not trying to rush you, but thought I'd check in to see if you noticed the previous couple of comments. Is converting to sync a viable solution here? Let me know if I can run more try experiments or help with anything else.

Flags: needinfo?(dkeeler)

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Comment 86

•

7 years ago

I did see those comments, but thanks for the needinfo anyway. Converting that function to a synchronous call isn't really viable - the file reading/public key operations are long running and not something we want to block the main thread. What's odd is this use of CryptoTask is failing so frequently while others don't seem to. I'll keep looking, though.

Flags: needinfo?(dkeeler)

Comment hidden (Intermittent Failures Robot)

65 failures in 845 pushes (0.077 failures/push) were associated with this bug in the last 7 days. 

This is the #24 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 26
* mozilla-inbound: 20
* mozilla-central: 8
* mozilla-aurora: 7
* try: 3
* mozilla-esr52: 1

Platform breakdown:
* linux64: 40
* windows8-64: 7
* windows7-32-vm: 7
* linux32: 6
* windows7-32: 4
* linux32-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-03-27&endday=2017-04-02&tree=all

Daniel Veditz [:dveditz]

Updated

•

7 years ago

Flags: needinfo?(dveditz)

Comment hidden (Intermittent Failures Robot)

60 failures in 867 pushes (0.069 failures/push) were associated with this bug in the last 7 days. 

This is the #26 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-inbound: 24
* autoland: 22
* try: 3
* pine: 3
* mozilla-beta: 3
* mozilla-aurora: 3
* mozilla-central: 2

Platform breakdown:
* linux64: 22
* windows8-64: 17
* linux32: 12
* linux64-nightly: 3
* windows7-32-vm: 2
* windows7-32: 2
* osx-10-10: 1
* linux32-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-04-03&endday=2017-04-09&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 89

•

7 years ago

:keeler, this is still causing a lot of pain on our trees, can you look into a more appropriate fix for openSignedAppFileAsync  (per comment 13) ?

Flags: needinfo?(dkeeler)

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Comment 90

•

7 years ago

I didn't manage to get to this today, but it's on my radar. Keeping the ni? so I remember to have a look.

Dana Keeler (she/her) (use needinfo) [:keeler] (on leave)

Comment 91

•

7 years ago

Geoff, I'm having the hardest time reproducing this, even when basing off your original logging patch. Any idea what I'm doing wrong?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=850c91be735b521ac0db3637d572bf0cd5a80f2c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1d18042d0b35b1341d1e0527ff6f115d44dbef18
https://treeherder.mozilla.org/#/jobs?repo=try&revision=37ee630d4769525854ef565b0d3917e2c4657d84

Flags: needinfo?(dkeeler) → needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 92

•

7 years ago

It looks like you are doing everything right. I think something has shifted and my trick for reproducing is not working now.

I thought perhaps we would have better luck with a different test job, but that logging patch generates so much output, almost all the tests fail, and create massive log files. I trimmed it down with more hackery, in https://treeherder.mozilla.org/#/jobs?repo=try&revision=46f757fb8a70752db14cad81718db384b610cc17, but I'm still not having any luck reproducing the failure.

The failure is still happening, according to Orange Factor, but reproducing it on try seems difficult. If you have a possible solution, I'd suggest landing it and seeing if the failures go away. Otherwise, we may just need to wait and see if we can reproduce it again in future. (Or rollback and push a revision from a couple weeks ago?)

Flags: needinfo?(gbrown)

Comment hidden (Intermittent Failures Robot)

16 failures in 894 pushes (0.018 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 6
* try: 4
* mozilla-inbound: 3
* mozilla-aurora: 2
* graphics: 1

Platform breakdown:
* linux64: 6
* linux32: 6
* windows8-64: 3
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-04-10&endday=2017-04-16&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 94

•

7 years ago

oddly this has greatly reduced in frequency after April 11th.

Whiteboard: [stockwell needswork] → [stockwell unknown]

Comment hidden (Intermittent Failures Robot)

19 failures in 817 pushes (0.023 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 9
* graphics: 3
* mozilla-inbound: 2
* mozilla-central: 2
* mozilla-beta: 2
* try: 1

Platform breakdown:
* windows8-64: 7
* linux64: 6
* linux32: 3
* windows7-32-vm: 2
* linux64-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-04-17&endday=2017-04-23&tree=all

Comment hidden (Intermittent Failures Robot)

47 failures in 883 pushes (0.053 failures/push) were associated with this bug in the last 7 days. 

This is the #20 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-inbound: 18
* autoland: 16
* try: 5
* graphics: 5
* mozilla-central: 2
* mozilla-beta: 1

Platform breakdown:
* linux64: 17
* linux32: 17
* windows8-64: 12
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-04-24&endday=2017-04-30&tree=all

Geoff Brown [:gbrown]

Updated

•

7 years ago

Comment 102

•

7 years ago

This is failing frequently again. I pushed the logging patch that :keeler and I were using earlier: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab246e341381addda1437dd67039e0506d5fadde. Good news: It is now easy to reproduce this on linux32-debug. Bad news: The hang seems to have moved.

I'll try to track it down again.

Flags: needinfo?(gbrown)

Comment hidden (Intermittent Failures Robot)

38 failures in 129 pushes (0.295 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 14
* mozilla-inbound: 11
* mozilla-central: 6
* try: 5
* graphics: 2

Platform breakdown:
* linux32: 30
* windows8-64: 5
* windows7-32-vm: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-08&endday=2017-05-08&tree=all

Geoff Brown [:gbrown]

Comment 104

•

7 years ago

I am seeing a few different causes of this failure. Let's look at one...

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c8d621a1f21ba3c5f6c63e3595c2c0c0d9b2e028

https://public-artifacts.taskcluster.net/CH-cN9NLRpqunwGTEKcUxQ/0/public/logs/live_backing.log

 20:39:51     INFO - TEST-INFO | started process GECKO(1082)
 20:39:51     INFO - GECKO(1082) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /tmp/tmpxcE5GX.mozrunner/runtests_leaks.log
 20:41:44     INFO - GECKO(1082) | XRE_mainRun inits commandline...
...
 20:41:45     INFO - GECKO(1082) | 1494362505551	Marionette	INFO	marionette enabled by command line flag
...
 20:41:49     INFO - GECKO(1082) | ++DOMWINDOW == 12 (0xde3e3c00) [pid = 1082] [serial = 12] [outer = 0xde4b8c00]
 20:41:51     INFO -  Traceback (most recent call last):
...
 20:41:51     INFO -  error: [Errno 111] Connection refused
 20:41:51    ERROR - Automation Error: Received unexpected exception while running application

It seems the -marionette flag has been seen and Marionette enabled -- https://hg.mozilla.org/try/rev/af523ebe17152d836b87244f56a79d81aed0337a#l1.12 -- but not initialized -- https://hg.mozilla.org/try/rev/af523ebe17152d836b87244f56a79d81aed0337a#l1.28.

There was activity just before the connection error, at 20:41:49...I wonder if we just hadn't gotten around to initializing MarionetteComponent. It looks like the connection error happens 120 seconds after process launch; could that time period be lengthened, and might that avoid some of these failures, especially on slow platforms like linux32-debug?

:whimboo -- Thoughts?

Flags: needinfo?(gbrown) → needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 105

•

7 years ago

Hm, we wait for the "sessionstore-windows-restored" observer topic to be fired, before we do the initialization. Maybe this one is coming through too late or maybe never?

https://dxr.mozilla.org/mozilla-central/rev/b21b974d60d3075ae24f6fb1bae75d0f122f28fc/testing/marionette/components/marionette.js#197

On Windows we also wait until the GFX sanity window has been closed.

Flags: needinfo?(hskupin)

Geoff Brown [:gbrown]

Comment 106

•

7 years ago

Thanks - I was wondering how that worked.

Reviewing some of the successful jobs in https://treeherder.mozilla.org/#/jobs?repo=try&revision=c8d621a1f21ba3c5f6c63e3595c2c0c0d9b2e028, I see more evidence that it can take ~2 minutes to init marionette:

https://public-artifacts.taskcluster.net/FmPpTwH8SOS9xFVUKNGhMg/0/public/logs/live_backing.log

20:42:26     INFO - TEST-INFO | started process GECKO(1096)
20:44:18     INFO - GECKO(1096) | 1494362658486	Marionette	INFO	Marionette init
20:44:18     INFO - GECKO(1096) | 1494362658960	Marionette	INFO	Listening on port 2828

https://public-artifacts.taskcluster.net/SPeyE3TqRXq9EAUhcBxEaQ/0/public/logs/live_backing.log
20:40:44     INFO - TEST-INFO | started process GECKO(1088)
20:42:24     INFO - GECKO(1088) | 1494362544274	Marionette	INFO	Marionette init
20:42:24     INFO - GECKO(1088) | 1494362544695	Marionette	INFO	Listening on port 2828

It can take less time - between 60 and 90 seconds seems common - but it isn't hard to find cases close to 120 seconds.

Comment hidden (Intermittent Failures Robot)

26 failures in 171 pushes (0.152 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 10
* mozilla-inbound: 9
* try: 4
* mozilla-central: 2
* graphics: 1

Platform breakdown:
* linux32: 23
* windows7-32-vm: 2
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-09&endday=2017-05-09&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 108

•

7 years ago

Geoff, would you mind to push a try with the following changes:
https://hg.mozilla.org/try/rev/d53acf55de513e4eaa6915877d9556e9d896bd3d

I use that also for another bug and this output is pretty helpful. It will show us which observer notifications Marionette actually receives. It should give us an indication where in the initialization code we are waiting.

Geoff Brown [:gbrown]

Comment 109

•

7 years ago

No problem: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bacb39ac18ec2b9bee13028ced5b1f0b666ab6d7

Here's an example of this failure with your changes: https://public-artifacts.taskcluster.net/ABmnEm2_RXCfCYbOgPEi_Q/0/public/logs/live_backing.log

Geoff Brown [:gbrown]

Comment 110

•

7 years ago

I also tried increasing DEFAULT_STARTUP_TIMEOUT from 120 seconds to 180 seconds: https://treeherder.mozilla.org/#/jobs?repo=try&revision=300dc393115b1973cf7c28ee522c45dd1dec11d0. With this change, I could not reproduce this failure.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 111

•

7 years ago

So here the timing for a failing test:

[task 2017-05-10T15:08:46.572745Z] 15:08:46     INFO - TEST-INFO | started process GECKO(1085)
[..]
[task 2017-05-10T15:09:54.628136Z] 15:09:54     INFO - GECKO(1085) | 1494428994626	Marionette	INFO	Received observer notification "sessionstore-windows-restored"
[task 2017-05-10T15:09:55.037043Z] 15:09:55     INFO - GECKO(1085) | 1494428995035	Marionette	INFO	Listening on port 2828

It means that it has been taken 1:10min to start Marionette for this debug build. This is insanely long! 

The Linux workers are using Ubuntu 16.04.2 LTS, so bug 1345105 shouldn't have an effect here.

Geoff I would propose that you request a one-click loaner, run the setup task, and run the tests manually by inspecting the cpu load time while firefox is starting up. I could imagine there is some other process which takes a fair amount of cpu power and slows Firefox down.

Comment hidden (Intermittent Failures Robot)

28 failures in 167 pushes (0.168 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-inbound: 11
* autoland: 10
* graphics: 5
* try: 2

Platform breakdown:
* linux32: 20
* windows8-64: 5
* windows7-32-vm: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-10&endday=2017-05-10&tree=all

Geoff Brown [:gbrown]

Comment 113

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #111)
> It means that it has been taken 1:10min to start Marionette for this debug
> build. This is insanely long! 

It is a long time from a user perspective, but this is linux32 and debug and some cheap aws machine -- I don't feel qualified to criticize! :)

It looks to me like startup in this environment is usually between 60 and 120 seconds, with something like 3% of jobs taking more than 120 seconds (causing failures here).

> Geoff I would propose that you request a one-click loaner, run the setup
> task, and run the tests manually by inspecting the cpu load time while
> firefox is starting up. I could imagine there is some other process which
> takes a fair amount of cpu power and slows Firefox down.

Good idea. I did that, I noted startup taking 60 seconds+, as expected. top indicated firefox-bin taking 95% CPU, python taking 4% CPU during startup; I didn't see any other significant CPU use during startup.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 114

•

7 years ago

Ok, in that case we have two options... 1) invest in faster workers, or 2) bump up the start time timeout. I will leave it up to you to decide. I only wonder if we save money or not when using such slow workers which take much, much longer to run a test job, than having a faster worker and shorter job times.

Geoff Brown [:gbrown]

Updated

•

7 years ago

Depends on: 1364228

Comment hidden (Intermittent Failures Robot)

24 failures in 186 pushes (0.129 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 11
* try: 5
* mozilla-inbound: 5
* mozilla-central: 3

Platform breakdown:
* linux32: 23
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-11&endday=2017-05-11&tree=all

Comment hidden (Intermittent Failures Robot)

32 failures in 133 pushes (0.241 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 11
* mozilla-inbound: 9
* mozilla-central: 7
* graphics: 4
* mozilla-beta: 1

Platform breakdown:
* linux32: 28
* linux64: 2
* windows7-32: 1
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-12&endday=2017-05-12&tree=all

Comment hidden (Intermittent Failures Robot)

168 failures in 879 pushes (0.191 failures/push) were associated with this bug in the last 7 days. 

This is the #3 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* autoland: 60
* mozilla-inbound: 55
* mozilla-central: 22
* try: 16
* graphics: 13
* mozilla-beta: 2

Platform breakdown:
* linux32: 140
* windows8-64: 13
* windows7-32-vm: 9
* linux64: 3
* windows7-32: 2
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-08&endday=2017-05-14&tree=all

Geoff Brown [:gbrown]

Updated

•

7 years ago

Assignee: nobody → gbrown

Geoff Brown [:gbrown]

Comment 118

•

7 years ago

I am trying to address this in bug 1364228, and am hopeful that changes there will reduce failure frequency by 80%+.

Whiteboard: [stockwell unknown] → [stockwell needswork]

Comment hidden (Intermittent Failures Robot)

21 failures in 111 pushes (0.189 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 13
* mozilla-inbound: 6
* try: 1
* graphics: 1

Platform breakdown:
* linux32: 19
* windows7-32-vm: 1
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-15&endday=2017-05-15&tree=all

Comment hidden (Intermittent Failures Robot)

25 failures in 77 pushes (0.325 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-inbound: 6
* autoland: 6
* mozilla-central: 5
* graphics: 4
* try: 2
* mozilla-release: 1
* mozilla-beta: 1

Platform breakdown:
* linux32: 21
* linux32-nightly: 3
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-16&endday=2017-05-16&tree=all

Comment hidden (Intermittent Failures Robot)

36 failures in 188 pushes (0.191 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-inbound: 13
* autoland: 10
* mozilla-central: 7
* try: 3
* graphics: 3

Platform breakdown:
* linux32: 31
* windows7-32-vm: 3
* osx-10-10: 1
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-17&endday=2017-05-17&tree=all

Geoff Brown [:gbrown]

Comment 122

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #118)
> I am trying to address this in bug 1364228, and am hopeful that changes
> there will reduce failure frequency by 80%+.

Nearly all (but not quite all) linux failures have been eliminated.

Assignee: gbrown → nobody

Whiteboard: [stockwell needswork] → [stockwell fixed]

Comment hidden (Intermittent Failures Robot)

106 failures in 777 pushes (0.136 failures/push) were associated with this bug in the last 7 days. 

This is the #4 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* autoland: 41
* mozilla-inbound: 32
* mozilla-central: 14
* graphics: 8
* try: 7
* mozilla-beta: 3
* mozilla-release: 1

Platform breakdown:
* linux32: 86
* windows7-32-vm: 5
* linux64: 5
* windows8-64: 4
* osx-10-10: 3
* linux32-nightly: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-15&endday=2017-05-21&tree=all

Comment hidden (Intermittent Failures Robot)

20 failures in 891 pushes (0.022 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-central: 6
* mozilla-beta: 5
* autoland: 5
* try: 2
* mozilla-inbound: 2

Platform breakdown:
* linux32: 11
* linux64: 6
* windows7-32-vm: 2
* linux64-devedition: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-22&endday=2017-05-28&tree=all

Comment hidden (Intermittent Failures Robot)

14 failures in 820 pushes (0.017 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 5
* autoland: 4
* mozilla-central: 3
* mozilla-beta: 1
* graphics: 1

Platform breakdown:
* linux32: 5
* windows8-64: 3
* windows7-32-vm: 1
* osx-10-10: 1
* linux64-stylo: 1
* linux64-jsdcov: 1
* linux64: 1
* linux32-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-05-29&endday=2017-06-04&tree=all

Comment hidden (Intermittent Failures Robot)

26 failures in 864 pushes (0.03 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 14
* mozilla-inbound: 8
* mozilla-central: 3
* try: 1

Platform breakdown:
* linux32: 18
* windows7-32: 3
* windows8-64: 2
* osx-10-10: 2
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-06-05&endday=2017-06-11&tree=all

Comment hidden (Intermittent Failures Robot)

29 failures in 814 pushes (0.036 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 10
* mozilla-central: 7
* autoland: 6
* mozilla-beta: 4
* try: 2

Platform breakdown:
* windows8-64: 7
* windows7-32: 6
* linux32: 6
* windows7-32-vm: 3
* osx-10-10: 3
* linux64: 2
* windows7-32-devedition: 1
* linux32-devedition: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-06-12&endday=2017-06-18&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 128

•

7 years ago

this picked up on the 17th, primarily on osx opt in mochitest*;  :gbrown, can you maybe help narrow this down or do your magic to make this actionable?

Flags: needinfo?(gbrown)

Whiteboard: [stockwell fixed] → [stockwell needswork]

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 129

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #128)
> this picked up on the 17th, primarily on osx opt in mochitest*;  :gbrown,
> can you maybe help narrow this down or do your magic to make this actionable?

Maybe a side-effect from moving from Buildbot to Taskcluster?

Comment hidden (Intermittent Failures Robot)

27 failures in 151 pushes (0.179 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 17
* mozilla-inbound: 5
* mozilla-beta: 3
* mozilla-central: 2

Platform breakdown:
* osx-10-10: 18
* windows8-64: 5
* linux32: 2
* linux64: 1
* linux32-devedition: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-06-19&endday=2017-06-19&tree=all

Geoff Brown [:gbrown]

Comment 131

•

7 years ago

Looking at the osx-10-10 failures only, I see - 
 - opt only, non-e10s only
 - various mochitests...perhaps biased to mochitest-chrome, mochitest-bc?
 - fails after 2 minutes of trying to connect; successful runs generally connect in a few seconds
 - earliest that I see is https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=00fa4f730bf6b7779a8fb78aaef3e23322f5750a&filter-searchStr=osx, but that certainly didn't cause this.

Comment hidden (Intermittent Failures Robot)

35 failures in 173 pushes (0.202 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 16
* mozilla-inbound: 11
* mozilla-central: 8

Platform breakdown:
* osx-10-10: 25
* windows8-64: 9
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-06-20&endday=2017-06-20&tree=all

Geoff Brown [:gbrown]

Comment 133

•

7 years ago

I'm debugging on try...may take a while.

Also, note whimboo's comments in bug 1374762.

Geoff Brown [:gbrown]

Comment 134

•

7 years ago

In the failed cases, I find profile-after-change is seen but sessionstore-windows-restored is not.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 135

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #134)
> In the failed cases, I find profile-after-change is seen but
> sessionstore-windows-restored is not.

Yeah, that would indicate that we might have started Firefox in safe mode. The application command doesn't show it:

17:37:54     INFO - Application command: /Users/cltbld/tasks/task_1498003975/build/application/Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/p5/c6r39pb97p5___0gvb4wyhzr00000w/T/tmpPpotiL.mozrunner

But I wonder which code paths could actually trigger the safe mode to become active. Geoff, have you had a chance to see it on a one click loaner? If we can see it with VNC connected that would be great.

Comment hidden (Intermittent Failures Robot)

27 failures in 170 pushes (0.159 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 18
* mozilla-inbound: 5
* mozilla-central: 4

Platform breakdown:
* osx-10-10: 23
* windows8-64: 2
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-06-21&endday=2017-06-21&tree=all

Geoff Brown [:gbrown]

Comment 137

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #135)
> active. Geoff, have you had a chance to see it on a one click loaner? If we
> can see it with VNC connected that would be great.

I have not seen it on a one click loaner. I have only reproduced it with try pushes like https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e781e0c5b0bf42723b7a8f2cd14062916f3b309.

The latest result indicates sessionstore-windows-restored is not received because browser.js's onLoad() is not called:

https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1255

and therefore doesn't notify sessionstore-windows-restored:

https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1313

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 138

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #137)
> I have not seen it on a one click loaner. I have only reproduced it with try
> pushes like
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=8e781e0c5b0bf42723b7a8f2cd14062916f3b309.

bc2 and bc3 are failing way more than other tests. So having a look at the VNC screen, while running such a test multiple times, should give us more information. It would be good to know if the browser windows comes up at all, or eg if safe mode is active.

Joel Maher ( :jmaher ) (UTC -8)

Comment 139

•

7 years ago

I don't think there is a vnc screen to access- these tests all run in AWS and as gbrown mentioned he hasn't seen this on a one click loaner.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 140

•

7 years ago

Every one click loaner at least for Linux workers allow you to also connect to the screen via VNC. If he hasn't seen it yet, he might have not run enough test jobs to trigger the behavior. It's all the same environment so I do not understand why a one click loaner should not be helpful here?

Joel Maher ( :jmaher ) (UTC -8)

Comment 141

•

7 years ago

:whimboo, a one click loaner is not always realistic to reproduce problems- many times I have had to give up and go to try server, have you seen this failure while inside a one click loaner- maybe you have tips for how to reproduce it.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 142

•

7 years ago

What is try server? If you mean try builds then those jobs also allow a one click loaner. Again, tests on those run totally the same way. I really don't understand the problem here, and we may want to discuss this outside of this bug.

Geoff Brown [:gbrown]

Comment 143

•

7 years ago

I just haven't used a one-click loaner for this because it seemed inconvenient: With a try push I can start 100+ mochitest jobs and almost always reproduce this problem a couple of times. I don't fancy sitting on a one-click loaner re-running mochitest 100 times. I'll give it a shot with bc2/bc3...see if I can catch it.

Geoff Brown [:gbrown]

Comment 144

•

7 years ago

...or not!

It looks like one-click loaners are not available for osx (can select one-click loaner option in treeherder, but disabled in taskcluster).

This happens too infrequently on linux to try to reproduce there.

Comment hidden (Intermittent Failures Robot)

37 failures in 175 pushes (0.211 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 18
* mozilla-central: 10
* mozilla-inbound: 9

Platform breakdown:
* osx-10-10: 28
* windows8-64: 4
* linux32: 4
* macosx64-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-06-22&endday=2017-06-22&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 146

•

7 years ago

possibly getting a physical loaner?

this could be a side effect of running tests in taskcluster vs buildbot- the timing looks about right.

Comment hidden (Intermittent Failures Robot)

157 failures in 892 pushes (0.176 failures/push) were associated with this bug in the last 7 days. 

This is the #3 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* autoland: 77
* mozilla-inbound: 40
* mozilla-central: 35
* mozilla-beta: 3
* try: 1
* cedar: 1

Platform breakdown:
* osx-10-10: 111
* windows8-64: 29
* linux32: 12
* macosx64-nightly: 3
* linux64: 1
* linux32-devedition: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-06-19&endday=2017-06-25&tree=all

Geoff Brown [:gbrown]

Comment 148

•

7 years ago

There have been no new failures on any platform since June 24 and none on OSX since June 23. That's a weekend, with fewer pushes than normal, but a try push seems to confirm: This isn't happening any more.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb4128adf5fa4d197c868e2f33d5039c18d1eb99

My planned next step here was to try to track down why browser.js' onLoad() was not being called, even though xpcom startup seemed fine. I also wanted to check if this might be related to safe mode, but didn't have a concrete idea of how to pursue either idea.

Hoping this stays fixed...

Flags: needinfo?(gbrown)

Comment hidden (Intermittent Failures Robot)

22 failures in 718 pushes (0.031 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 10
* mozilla-inbound: 5
* mozilla-central: 5
* mozilla-release: 1
* mozilla-beta: 1

Platform breakdown:
* linux64: 9
* linux32: 8
* windows8-64: 4
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-06-26&endday=2017-07-02&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 150

•

7 years ago

My patch for bug 1374762 landed 2 days ago. Since then we haven't had any single failure anymore. So I really wonder if that could have been due to an inappropriate start in safe mode.

Please note that with my patch I added a DEBUG log line for Marionette which prints out when the safe mode dialog gets handled. Sadly be default this will not show up in the logs of other test suites because the preference `marionette.log.level = 'trace'` is not used. Maybe it should be added so that Marionette prints a bit more output in case of failures.

Geoff, what do you think?

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Comment 151

•

7 years ago

There have been a couple of failures today. :(

...But there has been lots of improvement here, the majority of the improvement around June 24, and perhaps some additional improvement around the time of your safe mode change - that's great!


I find investigating and debugging these "Connection refused" failures tiresome and welcome additional logging that would help narrow the cause of the failure more quickly. Looking at the current Marionette test logs, I'm not convinced the TRACE messages would help much: It looks like "Listening on port 2828" is still the first message, and I don't see much more logging at startup. I'm also concerned by all the TRACE messages while tests are running: Our tests logs are already so big and convoluted, I don't want to see that much more added unless we really need it. Maybe it's best to keep it the way it is? It's easy to add TRACE logs in try pushes.

Flags: needinfo?(gbrown)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 152

•

7 years ago

Geoff, it should actually be enough to use `marionette.log.level = 'debug'`. Because that is what we actually use to print out other information during startup and page loads. The latter shouldn't matter here because it would only happen for page load requests as done via Marionette, and that's nothing those test harnesses is doing.

Otherwise we could add a info level output for how Marionette has been enabled (command line flag or environment variable). It would indicate that it's running but not necessarily have the server started.

Geoff Brown [:gbrown]

Comment 153

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #152)
> startup and page loads. The latter shouldn't matter here because it would
> only happen for page load requests as done via Marionette, and that's
> nothing those test harnesses is doing.

Oh, I see - I hadn't thought of that. That sounds okay then.

> Otherwise we could add a info level output for how Marionette has been
> enabled (command line flag or environment variable). It would indicate that
> it's running but not necessarily have the server started.

That sounds good too!

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Depends on: 1378209

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 154

•

7 years ago

(In reply to Geoff Brown [:gbrown] from comment #153)
> > startup and page loads. The latter shouldn't matter here because it would
> > only happen for page load requests as done via Marionette, and that's
> > nothing those test harnesses is doing.
> 
> Oh, I see - I hadn't thought of that. That sounds okay then.

I would leave this up to you then given that this is outside of Marionette.

> > Otherwise we could add a info level output for how Marionette has been
> > enabled (command line flag or environment variable). It would indicate that
> > it's running but not necessarily have the server started.
> 
> That sounds good too!

I filed bug 1378209 to get this implemented soon.

Comment hidden (Intermittent Failures Robot)

27 failures in 656 pushes (0.041 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 13
* autoland: 12
* mozilla-central: 1
* mozilla-beta: 1

Platform breakdown:
* windows8-64: 15
* linux32: 8
* linux64: 2
* windows7-32-devedition: 1
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-07-03&endday=2017-07-09&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 156

•

7 years ago

So a normal application start looks like this:

[task 2017-07-09T10:27:16.225590Z] 10:27:16     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpABZapw.mozrunner
[task 2017-07-09T10:27:16.246287Z] 10:27:16     INFO - runtests.py | Application pid: 1813
[task 2017-07-09T10:27:16.246599Z] 10:27:16     INFO - TEST-INFO | started process GECKO(1813)
[task 2017-07-09T10:27:18.870075Z] 10:27:18     INFO - GECKO(1813) | 1499596038867	Marionette	INFO	Enabled via --marionette
[task 2017-07-09T10:27:23.520937Z] 10:27:23     INFO - GECKO(1813) | 1499596043514	Marionette	INFO	Listening on port 2828

As you can see Marionette is enabled via the command line flag. But here is the output from a non-working case:

[task 2017-07-09T10:30:16.098939Z] 10:30:16     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpoguz_J.mozrunner
[task 2017-07-09T10:30:16.114279Z] 10:30:16     INFO - runtests.py | Application pid: 2431
[task 2017-07-09T10:30:16.114769Z] 10:30:16     INFO - TEST-INFO | started process GECKO(2431)
[task 2017-07-09T10:33:16.192469Z] 10:33:16     INFO -  Traceback (most recent call last):
[task 2017-07-09T10:33:16.195312Z] 10:33:16     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2534, in doTests
[task 2017-07-09T10:33:16.197115Z] 10:33:16     INFO -      marionette_args=marionette_args,
[task 2017-07-09T10:33:16.199086Z] 10:33:16     INFO -    File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2135, in runApp
[task 2017-07-09T10:33:16.200887Z] 10:33:16     INFO -      self.marionette.start_session(timeout=port_timeout)
[task 2017-07-09T10:33:16.202293Z] 10:33:16     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-07-09T10:33:16.203538Z] 10:33:16     INFO -      m._handle_socket_failure()
[task 2017-07-09T10:33:16.205368Z] 10:33:16     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-07-09T10:33:16.207145Z] 10:33:16     INFO -      return func(*args, **kwargs)
[task 2017-07-09T10:33:16.208370Z] 10:33:16     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1208, in start_session
[task 2017-07-09T10:33:16.210069Z] 10:33:16     INFO -      self.protocol, _ = self.client.connect()
[task 2017-07-09T10:33:16.211347Z] 10:33:16     INFO -    File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 223, in connect
[task 2017-07-09T10:33:16.212672Z] 10:33:16     INFO -      self.sock.connect((self.addr, self.port))
[task 2017-07-09T10:33:16.213892Z] 10:33:16     INFO -    File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2017-07-09T10:33:16.215644Z] 10:33:16     INFO -      return getattr(self._sock,name)(*args)
[task 2017-07-09T10:33:16.216847Z] 10:33:16     INFO -  error: [Errno 111] Connection refused
[task 2017-07-09T10:33:16.218212Z] 10:33:16    ERROR - Automation Error: Received unexpected exception while running application

The Marionette component didn't get started. Even not after 3 minutes! And as such the client times out. This indicates that there is clearly a problem in Firefox, whether with initializing components, or a hang / crash during startup.

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell needswork] → [stockwell unk

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell unk → [stockwell unknown]

Comment hidden (Intermittent Failures Robot)

35 failures in 720 pushes (0.049 failures/push) were associated with this bug in the last 7 days. 

This is the #39 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-inbound: 13
* mozilla-central: 7
* autoland: 7
* mozilla-beta: 5
* try: 3

Platform breakdown:
* windows8-64: 20
* linux32: 8
* linux64: 3
* windows7-32-devedition: 2
* windows7-32: 1
* linux32-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-07-10&endday=2017-07-16&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Depends on: 1382162

Comment hidden (Intermittent Failures Robot)

26 failures in 822 pushes (0.032 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 14
* mozilla-inbound: 10
* mozilla-central: 1
* mozilla-beta: 1

Platform breakdown:
* windows8-64: 13
* linux32: 9
* linux64: 3
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-07-17&endday=2017-07-23&tree=all

Comment hidden (Intermittent Failures Robot)

27 failures in 181 pushes (0.149 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 17
* mozilla-central: 5
* mozilla-inbound: 4
* mozilla-beta: 1

Platform breakdown:
* linux64: 13
* windows8-64: 6
* osx-10-10: 5
* linux32: 2
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-07-25&endday=2017-07-25&tree=all

Comment hidden (Intermittent Failures Robot)

38 failures in 218 pushes (0.174 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 18
* mozilla-inbound: 10
* mozilla-central: 5
* mozilla-beta: 3
* try: 2

Platform breakdown:
* linux64: 14
* windows8-64: 10
* osx-10-10: 10
* linux32: 2
* windows7-32: 1
* linux32-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-07-26&endday=2017-07-26&tree=all

Comment hidden (Intermittent Failures Robot)

26 failures in 186 pushes (0.14 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 14
* mozilla-inbound: 11
* cedar: 1

Platform breakdown:
* osx-10-10: 15
* windows8-64: 6
* linux64: 4
* windows10-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-07-27&endday=2017-07-27&tree=all

Geoff Brown [:gbrown]

Updated

•

7 years ago

Depends on: 1385281

Geoff Brown [:gbrown]

Updated

•

7 years ago

Whiteboard: [stockwell unknown] → [stockwell needswork]

Comment hidden (Intermittent Failures Robot)

18 failures in 174 pushes (0.103 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-inbound: 11
* autoland: 6
* mozilla-central: 1

Platform breakdown:
* osx-10-10: 9
* windows8-64: 5
* linux32: 2
* macosx64-stylo: 1
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-07-28&endday=2017-07-28&tree=all

Comment hidden (Intermittent Failures Robot)

149 failures in 1008 pushes (0.148 failures/push) were associated with this bug in the last 7 days. 

This is the #16 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* autoland: 76
* mozilla-inbound: 47
* mozilla-central: 18
* mozilla-beta: 5
* try: 2
* cedar: 1

Platform breakdown:
* linux64: 50
* osx-10-10: 47
* windows8-64: 36
* linux32: 10
* macosx64-stylo: 2
* windows7-32: 1
* windows10-64: 1
* linux64-stylo: 1
* linux32-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-07-24&endday=2017-07-30&tree=all

David Burns :automatedtester

Comment 164

•

7 years ago

It appears this bug has become a catch all... There are totally unrelated classifications against this bug.

David Burns :automatedtester

Updated

•

7 years ago

Updated

•

7 years ago

Comment 168

•

7 years ago

It was my impression that mochitests would soon move to install
addons in a different way and not use Marionette?  Has this plan
changed, or am I disillusioned?

From looking at a recent log:

> 16:35:06     INFO - GECKO(1097) | 1502728506714	Marionette	INFO	Enabled via --marionette

[…]

> 16:35:12     INFO -  error: [Errno 111] Connection refused
> 16:35:12    ERROR - Automation Error: Received unexpected exception while running application

In between here, we would have expected to see “Marionette INFO
Listening on port 2828”, indicating that the TCP listener in the
remote protocol has started up.

We use Services.tm.idleDispatchToMainThread to delay starting
up Marionette until Firefox has finished whatever it is
doing on the main thread to avoid potential synchronisation
problems.  You can see the relevant code for this in
https://searchfox.org/mozilla-central/rev/b258e6864ee3e809d40982bc5d0d5aff66a20780/testing/marionette/components/marionette.js#292-311.

I suspect that this check is not relevant to mochitest and that
mochitest may be instrumenting the content shell in such a way that
it never reaches an idle main thread. jmaher, does this sound like a
plausible theory?

Andreas Tolfsen ❲:ato❳

Updated

•

7 years ago

Flags: needinfo?(jmaher)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 169

•

7 years ago

Andreas please see bug 1382162 which might be the underlying issue here too. This is marked as dependency.

Joel Maher ( :jmaher ) (UTC -8)

Comment 170

•

7 years ago

:ato, the work to not use Marionette was for Talos only, not mochitest/reftest- that work has been done and mozprofile installs the webextensions.  Talos is different than Mochitest/Reftest in that it has signed addons/webextensions and we can have persistent installations- if we were to sign our addons for the unittests, then we could install them in a different way.

Flags: needinfo?(jmaher)

Comment hidden (Intermittent Failures Robot)

18 failures in 143 pushes (0.126 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 10
* mozilla-inbound: 5
* mozilla-beta: 2
* try: 1

Platform breakdown:
* linux64: 13
* linux32: 3
* linux32-stylo: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-08-22&endday=2017-08-22&tree=all

Comment hidden (Intermittent Failures Robot)

70 failures in 908 pushes (0.077 failures/push) were associated with this bug in the last 7 days. 

This is the #26 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 41
* mozilla-inbound: 15
* try: 6
* mozilla-beta: 5
* mozilla-central: 3

Platform breakdown:
* linux32: 26
* linux64: 24
* linux32-stylo: 16
* linux32-nightly: 2
* windows8-64: 1
* linux32-devedition: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-08-21&endday=2017-08-27&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Comment 173

•

7 years ago

:automatedtester- can you find someone to look at fixing this?  we have many failures/week showing up from this and marionette connections have been a persistent problem since we starting using marionette.

Flags: needinfo?(dburns)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 174

•

7 years ago

(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #173)
> :automatedtester- can you find someone to look at fixing this?  we have many
> failures/week showing up from this and marionette connections have been a
> persistent problem since we starting using marionette.

Once again, all those failures should be related to bug 1382162 which needs to be investigated and fixed first. Geoff spent most time on all this so far but has a very hard time to get it reproduced. I would suggest we are moving any discussion item over to bug 1382162?

Joel Maher ( :jmaher ) (UTC -8)

Comment 175

•

7 years ago

I agree, but we need someone to own resolving this issue as that bug has been idle for 10 days.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/174308/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/174308/

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 177

•

7 years ago

mozreview-review

Comment on attachment 8902637 [details]
Bug 1261598 - Set port timeout to be more than 60s for ASAN.

https://reviewboard.mozilla.org/r/174308/#review179476

::: testing/mozharness/configs/unittests/linux_unittest.py:119
(Diff revision 1)
>                  "--log-errorsummary=%(error_summary_file)s",
>                  "--use-test-media-devices",
>                  "--screenshot-on-fail",
>                  "--cleanup-crashes",
>                  "--marionette-startup-timeout=180",
> +                "--marionette-port-timeout=180"

The port timeout is by default 360s. Why should we reduce it here, so that its value is even clashing with the --marionette-startup-timeout?

I don't see that this will change anything. It only will abort tests earlier.

David Burns :automatedtester

Comment 178

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #177)
> Comment on attachment 8902637 [details]
> Bug 1261598 - Set port timeout to be more than 60s for ASAN.
> 
> https://reviewboard.mozilla.org/r/174308/#review179476
> 
> ::: testing/mozharness/configs/unittests/linux_unittest.py:119
> (Diff revision 1)
> >                  "--log-errorsummary=%(error_summary_file)s",
> >                  "--use-test-media-devices",
> >                  "--screenshot-on-fail",
> >                  "--cleanup-crashes",
> >                  "--marionette-startup-timeout=180",
> > +                "--marionette-port-timeout=180"
> 
> The port timeout is by default 360s. Why should we reduce it here, so that
> its value is even clashing with the --marionette-startup-timeout?
> 
> I don't see that this will change anything. It only will abort tests earlier.

The tests are timing out after 60s so its clearly not waiting 360s

Flags: needinfo?(dburns)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 179

•

7 years ago

(In reply to David Burns :automatedtester from comment #178)
> > ::: testing/mozharness/configs/unittests/linux_unittest.py:119
> > (Diff revision 1)
> > >                  "--log-errorsummary=%(error_summary_file)s",
> > >                  "--use-test-media-devices",
> > >                  "--screenshot-on-fail",
> > >                  "--cleanup-crashes",
> > >                  "--marionette-startup-timeout=180",
> > > +                "--marionette-port-timeout=180"
> > 
> > I don't see that this will change anything. It only will abort tests earlier.
> 
> The tests are timing out after 60s so its clearly not waiting 360s

No, tests are timing out after 3min (180s) which is startup timeout as listed above:

> [task 2017-08-26T08:04:23.069762Z] 08:04:23     INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpGylzXD.mozrunner
[..]
[task 2017-08-26T08:04:27.051735Z] 08:04:27     INFO - GECKO(1205) | 1503734667044	Marionette	INFO	Enabled via --marionette
[task 2017-08-26T08:07:29.551916Z] 08:07:29     INFO -  Traceback (most recent call last):

To get more details about this failure we could temporarily do the following:

1) Change Marionette's log lines for received observer notifications from DEBUG to INFO level
2) Update test harnesses which use Marionette to use the DEBUG log level instead of INFO

I would actually prefer option 2) which could simply be done via a preference as already mentioned in comment 150.

But basically it's still bug 1382162 we have to work on, because that is the underlying issue here.

Comment hidden (Intermittent Failures Robot)

62 failures in 939 pushes (0.066 failures/push) were associated with this bug in the last 7 days. 

This is the #38 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 26
* mozilla-inbound: 17
* try: 10
* mozilla-central: 6
* mozilla-beta: 3

Platform breakdown:
* linux64: 21
* linux32-stylo: 21
* linux32: 16
* windows8-64: 1
* osx-10-10: 1
* linux32-stylo-disabled: 1
* linux32-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-08-28&endday=2017-09-03&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Depends on: 1397734

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 181

•

7 years ago

Please note that some failures like the following are not caused by Marionette but we hit a MOZ_CRASH which causes a shutdown of Firefox.

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127986873&lineNumber=22362

[task 2017-09-02T11:53:03.283917Z] 11:53:03     INFO - GECKO(2785) | Hit MOZ_CRASH(Aborting on channel error.) at /builds/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2535
[task 2017-09-02T11:53:03.290153Z] 11:53:03     INFO - GECKO(2785) | #01: mozilla::ipc::ProcessLink::OnChannelError [ipc/glue/MessageLink.cpp:393]
[task 2017-09-02T11:53:03.290404Z] 11:53:03     INFO - 
[task 2017-09-02T11:53:03.293195Z] 11:53:03     INFO - GECKO(2785) | #02: event_process_active_single_queue [ipc/chromium/src/third_party/libevent/event.c:1648]

Shouldn't we also add MOZ_CRASH to the error list, so that those cases are getting reported correctly on Treeherder? Otherwise we will get falsely starred test failures like in the above case. Ed, can you give feedback?

Flags: needinfo?(emorley)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 182

•

7 years ago

So what I think goes also wrong here is the usage of Marionette by other test harnesses. I already reported that on bug 1393366 for wpt-tests. What's basically happening here is the following:

By default the startup_timeout of Marionette driver is 120s. BUT as we can see here it errors out already after 60s. The 120s are only used when you make use of `wait_for_port` or `raise_for_port` before the first call to `start_session`. Starting a session defaults to 60s, and uses this value inappropriately to wait for the server to be ready. Instead it would have to use the 120s: 

http://searchfox.org/mozilla-central/rev/44c693914255638d74bcf1ec3b0bcd448dfab2fd/testing/marionette/client/marionette_driver/marionette.py#1173-1202

I will take care of that in my patch on bug 1397675. This should definitely give us 60s more time, which should be sufficient.

Depends on: 1397675

Comment hidden (Intermittent Failures Robot)

15 failures in 173 pushes (0.087 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 10
* mozilla-inbound: 4
* mozilla-beta: 1

Platform breakdown:
* windows10-64: 7
* linux64: 5
* linux32: 2
* linux64-qr: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-08&endday=2017-09-08&tree=all

Comment hidden (Intermittent Failures Robot)

55 failures in 924 pushes (0.06 failures/push) were associated with this bug in the last 7 days. 

This is the #47 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 33
* mozilla-inbound: 14
* mozilla-central: 4
* try: 3
* mozilla-beta: 1

Platform breakdown:
* linux32: 18
* linux64: 15
* windows10-64: 10
* linux32-stylo-disabled: 4
* osx-10-10: 2
* linux32-stylo: 2
* linux64-stylo-disabled: 1
* linux64-stylo: 1
* linux64-qr: 1
* linux32-devedition: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-04&endday=2017-09-10&tree=all

Ed Morley [:emorley]

Comment 185

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #181)
> Shouldn't we also add MOZ_CRASH to the error list, so that those cases are
> getting reported correctly on Treeherder? Otherwise we will get falsely
> starred test failures like in the above case. Ed, can you give feedback?

I agree this sounds useful to have marked as an error (presuming it would result in no false positives). However we're trying to move away from hardcoding error strings in Treeherder, so I would prefer this to be addressed by making the mozharness log level prefix be ERROR rather than the current INFO. eg:

[task 2017-09-02T11:53:03.283917Z] 11:53:03     ERROR - GECKO(2785) | Hit MOZ_CRASH(...

Flags: needinfo?(emorley)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 186

•

7 years ago

(In reply to Ed Morley [:emorley] from comment #185)
> hardcoding error strings in Treeherder, so I would prefer this to be
> addressed by making the mozharness log level prefix be ERROR rather than the

Sure, I filed bug 1399993.

Comment hidden (Intermittent Failures Robot)

3 failures in 191 pushes (0.016 failures/push) were associated with this bug yesterday.    

** This test has failed more than 200 times in the last 30 days. It should be disabled until it can be fixed. ** 

Repository breakdown:
* autoland: 3

Platform breakdown:
* linux32: 2
* linux64-stylo-disabled: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-14&endday=2017-09-14&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 188

•

7 years ago

The patch on bug  1397675 has been landed yesterday. I will observe the results on Monday to see if also fixes this problem.

Comment hidden (Intermittent Failures Robot)

25 failures in 1032 pushes (0.024 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* autoland: 13
* mozilla-inbound: 4
* try: 3
* mozilla-central: 3
* mozilla-beta: 2

Platform breakdown:
* linux32: 12
* windows10-64: 4
* linux64: 3
* linux32-stylo-disabled: 3
* windows8-64: 2
* linux64-stylo-disabled: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-11&endday=2017-09-17&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 190

•

7 years ago

The number of failures dropped a lot after bug 1397675 got fixed, but there are still issues, and I think this is because of bug 
1382162 (missing session restore notification).

Joel Maher ( :jmaher ) (UTC -8)

Comment 191

•

7 years ago

I don't see many failures in the last day, in fact it looks silent for almost 16 hours now.  Lets see if this continues the trend by the end of the week :)

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 192

•

7 years ago

Well, there are still quite a lot of failures. And with each job waiting 3 minutes now for Firefox and Marionette to be ready nothing seem to have changed. This clearly indicates that the build is stuck during startup, and Marionette doesn't start because of it. Note, that even the command line arguments handler is not getting called! 

Most of the test failures are for mochitest-chrome on Windows 10, and similar to bug 1352671 only happening for jobs as run on buildbot. Given that Joel moved those to TaskCluster via bug 1401184 lets see if this changes something.

If not I'm still working on improving the logging of Marionette driver now.

Depends on: 1401184

Comment hidden (Intermittent Failures Robot)

19 failures in 190 pushes (0.1 failures/push) were associated with this bug yesterday.    

Repository breakdown:
* autoland: 13
* mozilla-inbound: 6

Platform breakdown:
* windows10-64: 18
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-20&endday=2017-09-20&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 194

•

7 years ago

There are mainly failures which happen for browser chrome tests, Joel is trying to move to TC now. Maybe that helps us.

Comment hidden (Intermittent Failures Robot)

19 failures in 199 pushes (0.095 failures/push) were associated with this bug yesterday.    

Repository breakdown:
* mozilla-inbound: 5
* autoland: 5
* mozilla-beta: 4
* mozilla-central: 3
* try: 2

Platform breakdown:
* windows10-64: 8
* linux32: 6
* linux32-stylo-disabled: 3
* linux64: 1
* linux32-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-21&endday=2017-09-21&tree=all

Comment hidden (Intermittent Failures Robot)

68 failures in 943 pushes (0.072 failures/push) were associated with this bug in the last 7 days. 

This is the #27 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **  

Repository breakdown:
* autoland: 38
* mozilla-inbound: 13
* mozilla-central: 8
* mozilla-beta: 6
* try: 3

Platform breakdown:
* windows10-64: 43
* linux32: 15
* linux32-stylo-disabled: 5
* linux64: 4
* linux32-nightly: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-18&endday=2017-09-24&tree=all

Comment hidden (Intermittent Failures Robot)

30 failures in 129 pushes (0.233 failures/push) were associated with this bug yesterday.    

Repository breakdown:
* autoland: 20
* mozilla-inbound: 9
* mozilla-beta: 1

Platform breakdown:
* windows10-64: 25
* linux32-stylo-disabled: 3
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-25&endday=2017-09-25&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 198

•

7 years ago

I can only see failures for browser-chrome jobs yesterday which run via buildbot but there is no single entry for those running via TC now. So moving them from hardware to VM seems to have helped a lot, and that was done via bug 1402068.

Top-most remaining failing jobs seem to be for mochitest-gl-e10s. Joel, is there anything we can do to also port them to VM?

Depends on: 1402068

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 199

•

7 years ago

for webgl/gpu/reftest we need a working GPU instance with a supported video card- right now there are issues with that which is why these are running on hardware.  That work is tracked in bug 1366288.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 200

•

7 years ago

a chance this could still come up on the radar for disable-recommended, but we are under the 200 threshold and I suspect will remain there.

Whiteboard: [stockwell disable-recommended] → [stockwell needswork]

Joel Maher ( :jmaher ) (UTC -8)

Comment 201

•

7 years ago

sorry, I was looking at bug 1391545 and accidentally changed this bug.

Whiteboard: [stockwell needswork] → [stockwell disable-recommended]

Comment hidden (Intermittent Failures Robot)

40 failures in 178 pushes (0.225 failures/push) were associated with this bug yesterday.    

Repository breakdown:
* autoland: 21
* mozilla-inbound: 17
* try: 1
* mozilla-beta: 1

Platform breakdown:
* windows10-64: 36
* linux32-stylo-disabled: 2
* linux32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-26&endday=2017-09-26&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 203

•

7 years ago

(In reply to OrangeFactor Robot from comment #202)
> 40 failures in 178 pushes (0.225 failures/push) were associated with this
> bug yesterday.    

Only 5 of those failures are from test jobs as run on TC workers. All the rest is coming from Buildbot workers. Especially the mochitest-gl-e10-* jobs seem to be mostly affected.

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Depends on: 1403484

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Depends on: 1403523

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Updated

•

7 years ago

Depends on: 1403616

Comment hidden (Intermittent Failures Robot)

26 failures in 153 pushes (0.17 failures/push) were associated with this bug yesterday.    

Repository breakdown:
* mozilla-inbound: 17
* autoland: 9

Platform breakdown:
* windows10-64: 25
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-27&endday=2017-09-27&tree=all

Comment hidden (Intermittent Failures Robot)

109 failures in 885 pushes (0.123 failures/push) were associated with this bug in the last 7 days. 

This is the #15 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **   

Repository breakdown:
* autoland: 59
* mozilla-inbound: 45
* try: 2
* mozilla-beta: 2
* mozilla-central: 1

Platform breakdown:
* windows10-64: 99
* linux32: 5
* linux32-stylo-disabled: 4
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2017-09-25&endday=2017-10-01&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Assignee

Comment 206

•

7 years ago

There are no more failures since September 28th. This is when my patch on bug 1403616 landed.

Assignee: nobody → hskupin

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox57: --- → fixed

status-firefox58: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla58

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell disable-recommended] → [stockwell fixed:other]

Comment hidden (Intermittent Failures Robot)

1 failures in 831 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1261598&startday=2018-02-26&endday=2018-03-04&tree=all

Comment hidden (Intermittent Failures Robot)

3 failures in 718 pushes (0.004 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 3

Platform breakdown:
* linux64: 3

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-03-26&endday=2018-04-01&tree=trunk

Comment hidden (Intermittent Failures Robot)

3 failures in 765 pushes (0.004 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 3

Platform breakdown:
* linux64: 3

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-04-02&endday=2018-04-08&tree=trunk

Comment hidden (Intermittent Failures Robot)

2 failures in 792 pushes (0.003 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 2

Platform breakdown:
* linux64: 2

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-04-09&endday=2018-04-15&tree=trunk

Comment hidden (Intermittent Failures Robot)

1 failures in 551 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-04-30&endday=2018-05-06&tree=trunk

Comment hidden (Intermittent Failures Robot)

1 failures in 660 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-05-07&endday=2018-05-13&tree=trunk

Comment hidden (Intermittent Failures Robot)

4 failures in 632 pushes (0.006 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 4

Platform breakdown:
* linux64: 4

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-05-21&endday=2018-05-27&tree=trunk

Comment hidden (Intermittent Failures Robot)

1 failures in 655 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-05-28&endday=2018-06-03&tree=trunk

Comment hidden (Intermittent Failures Robot)

4 failures in 669 pushes (0.006 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 4

Platform breakdown:
* linux64: 4

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-06-04&endday=2018-06-10&tree=trunk

Comment hidden (Intermittent Failures Robot)

1 failures in 408 pushes (0.002 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 1

Platform breakdown:
* linux64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-06-11&endday=2018-06-17&tree=trunk

Comment hidden (Intermittent Failures Robot)

4 failures in 521 pushes (0.008 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 4

Platform breakdown:
* linux64: 4

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-06-18&endday=2018-06-24&tree=trunk

Comment hidden (Intermittent Failures Robot)

2 failures in 718 pushes (0.003 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 2

Platform breakdown:
* linux64: 2

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-06-25&endday=2018-07-01&tree=trunk

Comment hidden (Intermittent Failures Robot)

2 failures in 622 pushes (0.003 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 2

Platform breakdown:
* linux64: 2

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-07-02&endday=2018-07-08&tree=trunk

Comment hidden (Intermittent Failures Robot)

2 failures in 623 pushes (0.003 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 2

Platform breakdown:
* linux64: 2

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1261598&startday=2018-07-09&endday=2018-07-15&tree=trunk

BMO Automation

Updated

•

1 year ago

Product: Testing → Remote Protocol

BMO Automation

Comment 221

•

1 year ago

Moving bug to Testing::Marionette Client and Harness component per bug 1815831.

Component: Marionette → Marionette Client and Harness

Product: Remote Protocol → Testing