Closed Bug 1434618 Opened 6 years ago Closed 6 years ago

Worker shutdown should start at xpcom-will-shutdown notification

Categories

(Core :: DOM: Workers, defect, P2)

58 Branch
defect

Tracking

()

RESOLVED INVALID
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox59 + wontfix
firefox60 + wontfix
firefox61 - wontfix
firefox62 --- wontfix
firefox63 --- wontfix

People

(Reporter: baku, Assigned: baku)

References

(Blocks 3 open bugs)

Details

Blocks: 1405290
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f31755ef4145
Worker shutdown should start at xpcom-will-shutdown notification, r=smaug
If this ends up helping out with shutdown hangs, we might want to consider uplift to 59.
Backed out for webdriver failures, e.g. webdriver/tests/actions/mouse_dblclick.py:

https://hg.mozilla.org/integration/mozilla-inbound/rev/0aa3298451dc7d725b1d98deb9461ae190cd635d

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=cf81d20f671314bd737e8db7ef9ed97f48f183b1&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=159591549&repo=mozilla-inbound

[task 2018-01-31T16:03:59.497Z] 16:03:59     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/mouse_dblclick.py::test_no_dblclick 
[task 2018-01-31T16:03:59.498Z] 16:03:59     INFO - PID 842 | 1517414639495	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.bcqhtUmGL4p0"
[task 2018-01-31T16:03:59.575Z] 16:03:59     INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/actions/mouse_dblclick.py | expected OK
Flags: needinfo?(amarchesini)
Severity: normal → major
Priority: -- → P2
Flags: needinfo?(amarchesini)
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/22fe0145f9ed
Worker shutdown should start at xpcom-will-shutdown notification, r=smaug
Backout by shindli@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1ed7050a69b4
Backed out changeset 22fe0145f9ed for Wd failures in /webdriver/tests/actions/mouse_dblclick.py on a CLOSED TREE
It seems that this patch is amplifying an existing problem. Bug 1414144 is an example of it.
Tomorrow I'll take a more closer look. whimboo, if you have any idea, let me know.
Flags: needinfo?(hskupin)
We have a couple of know shutdown hang scenarios. So I wouldn't be surprised if one of those could be the cause. Or do those pass or fail in a crash only?

Btw. the restarts in-between the tests in this test module aren't necessary anymore. A patch on bug 1432105 could help a lot here in reducing the runtime to maybe at maximum 15s of this test. Maja, do you think you can have a look at this? It would help a lot, so that Andrea can continue to land patches for slow shutdowns.
Depends on: 1432105
Flags: needinfo?(hskupin) → needinfo?(mjzffr)
Yes, I'll probably have time for bug 1432105 later today.
Flags: needinfo?(mjzffr)
Likely too late for 59 at this point with release 2 weeks away, but we can still consider taking patches for 60.
:baku, update?
Flags: needinfo?(amarchesini)
(In reply to Marion Daly [:mdaly] from comment #10)
> :baku, update?

We are blocked landing this patch by bug 1432105, which itself was blocked by a couple of other bugs. I worked hard to get all the dependencies fixed, and by today we should be able to get the patch on bug 1432105 landed too. So later today or tomorrow we can try to land this patch again, or start as best with a try build.
Andrea, would you mind rebasing your patch and push to try? We won't be able to land bug 1432105 today due to some other issues, but I would like to know how this patch works now for wdspec tests. Given that I enabled trace logs for Marionette yesterday we should get way better failure details. Thanks!
Andrea pushed a try build for which the linux32 wdspec tests are still failing:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=32e799340b1bf34d6f1d9c2932d30fbbe7318f0a&filter-searchStr=linux%20wd&selectedJob=166721977

The problem seems to be that moving to `xpcom-will-shutdown` causes even more / longer hangs. You can check the log which prints all the details now:

https://treeherder.mozilla.org/logviewer.html#?job_id=166721977&repo=try&lineNumber=5535

See all the wait for browser shutdown lines above.

Maybe this is all related to the tabmodal dialog tests in each of those files and the hangs as reported on bug 1443524 by Andreas.
Andreas, any comment about this?
Flags: needinfo?(amarchesini) → needinfo?(ato)
mouse_dblclick.py doesn’t, as far as I can tell from just
inspecting the source code, cause any tabmodal dialogues from
appearing.  But I suppose it is always possible that it could be
left over from an earlier test.

The best way to debug this would be to run it locally and inspect
what’s happening:

> % ./mach wpt testing/web-platform/tests/webdriver/tests/actions/mouse_dblclick.py --timeout-multiplier=100 --webdriver-arg=-vv

My experience has been that the shutdown hang is fairly easily
reproducible.
Flags: needinfo?(ato)
Andrea, please rebase your patch on mozilla-central and try again. Thanks.
Flags: needinfo?(amarchesini)
(In reply to Andrea Marchesini [:baku] from comment #17)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=90ab5c9fd2d7bf02d3f38b74bb553aca65bb4da8
> 
> still unhappy results.

Would you want to hand this off to someone else?
So far it's just about pushing to try. Not a big deal. 

whimboo do you have time to see why the patch still breaks the WD tests?
Flags: needinfo?(hskupin)
I see some strange behavior here. All is based on the following log:

https://treeherder.mozilla.org/logviewer.html#?job_id=169826750&repo=try (log viewer)
https://taskcluster-artifacts.net/W8In7kJvTES01pc-37mDGA/0/public/logs/live_backing.log (full log)


1) Geckodriver reports after only 5s that it failed to stop the browser process. How can this be given that we should wait 70s for that to be happening. Andreas, do you have an idea? Here the excerpt from the log:

> [task 2018-03-23T09:02:46.253Z] 09:02:46     INFO - PID 8530 | 1521795766245	Marionette	TRACE	0 -> [0,3,"quit",{"flags":["eForceQuit"]}]
[..]
> [task 2018-03-23T09:02:51.563Z] 09:02:51     INFO - PID 8530 | 1521795771555	geckodriver::marionette	ERROR	Failed to stop browser process

2) There are failures during shutdown for which I don't know if those could affect the shutdown time, and as such the issue as mentioned in 1). Maybe those are unrelated because I also see 1) without those failures:

[task 2018-03-23T09:02:53.945Z] 09:02:53     INFO - PID 8530 | ###!!! [Parent][DispatchAsyncMessage] Error: PClientManager::Msg_Teardown Route error: message sent to unknown actor ID
[task 2018-03-23T09:02:53.945Z] 09:02:53     INFO - PID 8530 | 
[task 2018-03-23T09:02:53.946Z] 09:02:53     INFO - PID 8530 | JavaScript error: resource://gre/modules/osfile/osfile_async_front.jsm, line 408: Error: OS.File has been shut down. Rejecting post to DirectoryIterator_prototype_close
[task 2018-03-23T09:02:53.947Z] 09:02:53     INFO - PID 8530 | JavaScript error: resource://gre/modules/osfile/osfile_async_front.jsm, line 408: Error: OS.File has been shut down. Rejecting post to DirectoryIterator_prototype_close
[task 2018-03-23T09:02:53.948Z] 09:02:53     INFO - PID 8530 | JavaScript error: resource://gre/modules/ProfileAge.jsm, line 174: Error: Unable to fetch oldest profile entry: Error: OS.File has been shut down. Rejecting post to DirectoryIterator_prototype_next
[task 2018-03-23T09:02:53.948Z] 09:02:53     INFO - PID 8530 | JavaScript error: resource://gre/modules/ProfileAge.jsm, line 174: Error: Unable to fetch oldest profile entry: Error: OS.File has been shut down. Rejecting post to DirectoryIterator_prototype_next
Flags: needinfo?(hskupin) → needinfo?(ato)
(In reply to Henrik Skupin (:whimboo) from comment #20)

> 1) Geckodriver reports after only 5s that it failed to stop the
> browser process. How can this be given that we should wait 70s
> for that to be happening. Andreas, do you have an idea? Here the
> excerpt from the log:
> 
> > [task 2018-03-23T09:02:46.253Z] 09:02:46     INFO - PID 8530 | 1521795766245	Marionette	TRACE	0 -> [0,3,"quit",{"flags":["eForceQuit"]}]
> [..]
> > [task 2018-03-23T09:02:51.563Z] 09:02:51     INFO - PID 8530 | 1521795771555	geckodriver::marionette	ERROR	Failed to stop browser process

I think std::process::Child::kill() errors if the process isn’t
running:

    https://doc.rust-lang.org/std/process/struct.Child.html#method.kill

mozrunner unconditionally calls this after
mozrunner::RunnerProcess::running() is false in the ::wait() loop:

    https://searchfox.org/mozilla-central/rev/003262ae12ce937950ffb8d3b0fa520d1cc38bff/testing/mozbase/rust/mozrunner/src/runner.rs#147
Flags: needinfo?(ato)
I filed bug 1448900 to get the geckodriver issue investigated.
Investigation has shown that the problem is just a confusing output in the log, and the browser process has been exited on its own. So it's nothing we would have to worry about. The problem should be something else...

A couple of tests require a new session of Firefox, and as such the browser is getting restarted. As it looks like there is always a 5s gap for debug builds in-between those two lines:

[task 2018-03-23T08:44:32.763Z] 08:44:32     INFO - PID 5373 | JavaScript error: resource://gre/modules/ProfileAge.jsm, line 174: Error: Unable to fetch oldest profile entry: Error: OS.File has been shut down. Rejecting request to DirectoryIterator_prototype_next
[task 2018-03-23T08:44:37.917Z] 08:44:37     INFO - PID 5373 | [Parent 5514, Main Thread] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /builds/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 809

That increases the shutdown time from 5s to 10s for every shutdown and results in a test duration of 365 seconds:

> [task 2018-03-23T08:44:14.042Z] 08:44:14     INFO - TEST-START | /webdriver/tests/sessions/new_session/create_alwaysMatch.py
[..]
> [task 2018-03-23T08:50:19.044Z] 08:50:19     INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/sessions/new_session/create_alwaysMatch.py | expected OK
> [task 2018-03-23T08:50:19.044Z] 08:50:19     INFO - TEST-INFO took 365002ms

I do not see that hang for regular CI builds on Treeherder. For those we have the following timing:

>[task 2018-03-25T23:09:34.731Z] 23:09:34     INFO - TEST-START | /webdriver/tests/sessions/new_session/create_alwaysMatch.py
[..]
> [task 2018-03-25T23:13:44.860Z] 23:13:44     INFO - TEST-OK | /webdriver/tests/sessions/new_session/create_alwaysMatch.py | took 250110ms

That means with your patch the affected tests need 50% more time to complete. I would also say that all tests are involved here which causes a restart of Firefox, but only those 3 are permanently failing because of the higher number of subtests needed to run.

Andrea, I hope that helps for further debugging the shutdown delays.
Flags: needinfo?(amarchesini)
Blocks: 1437575
Blocks: 1414144
No longer blocks: 1414144
Blocks: 988872
See Also: → 705178
Probably wontfix for 60 at this point.
:baku update?
No longer blocks: 1449538
Andrea, did you had a chance to test again on try? The timeouts of wdspec are all gone now.
I checked a try push from Andrea (as made with the patch on bug 1460239) and I noticed something interesting:

try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b20ebbb2bc64c73e1d7f606db8e99ada7a367775&filter-searchStr=wd

autoland:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=07d5c09b696accd800da11b297cd1f57c6d0f761&filter-searchStr=wd

Take for example the test create_alwaysMatch.py. Here I can see a 5s delay with the build on the try push every time the browser gets shutdown:

https://treeherder.mozilla.org/logviewer.html#?job_id=177700590&repo=try&lineNumber=62190-62193

> [task 2018-05-09T16:43:53.317Z] 16:43:53     INFO - PID 8331 | 1525884233312	Marionette	DEBUG	Received observer notification xpcom-will-shutdown
> [task 2018-05-09T16:43:53.318Z] 16:43:53     INFO - PID 8331 | 1525884233312	Marionette	DEBUG	Resetting recommended pref toolkit.cosmeticAnimations.enabled
> [task 2018-05-09T16:43:53.319Z] 16:43:53     INFO - PID 8331 | 1525884233313	Marionette	DEBUG	Resetting recommended pref datareporting.policy.dataSubmissionPolicyAccepted
> [task 2018-05-09T16:43:58.415Z] 16:43:58     INFO - PID 8331 | 1525884238411	geckodriver::marionette	DEBUG	Browser process stopped: exit code: 0

Marionette receives the xpcom-will-shutdown notification, and resets the prefs. Then it takes 5s until the browser exited.

Now compare it with browser shutdowns on autoland:

https://treeherder.mozilla.org/logviewer.html#?job_id=177714286&repo=autoland&lineNumber=64025-64028

> [task 2018-05-09T18:09:20.611Z] 18:09:20     INFO - PID 8253 | 1525889360603	Marionette	DEBUG	Received observer notification xpcom-will-shutdown
> [task 2018-05-09T18:09:20.612Z] 18:09:20     INFO - PID 8253 | 1525889360604	Marionette	DEBUG	Resetting recommended pref toolkit.cosmeticAnimations.enabled
> [task 2018-05-09T18:09:20.614Z] 18:09:20     INFO - PID 8253 | 1525889360604	Marionette	DEBUG	Resetting recommended pref datareporting.policy.dataSubmissionPolicyAccepted
> [task 2018-05-09T18:09:20.877Z] 18:09:20     INFO - PID 8253 | 1525889360872	geckodriver::marionette	DEBUG	Browser process stopped: exit code: 0

Firefox quits immediately without the 5s delay.

So moving the worker shutdown to xpcom-will-shutdown clearly produces another unwanted shutdown delay.
No longer depends on: 1460239
I'm still very interested in seeing this land in time for uplift to 61, but it's not a new issue either and I don't think tracking it for each release on an ongoing basis adds much value to anything.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #30)
> ... I don't think tracking it for each release
> on an ongoing basis adds much value to anything.

This wouldn't  be a problem  if you hadn't  switched to a rapid release schedule for new Firefox versions.
:Baku is this still on your radar?
Flags: needinfo?(amarchesini)
Yes, but it seems that is not needed anymore. The worker shutdown seems in a good shape at the moment. Am I right?
Initially I want to have this to separate the worker shutdown from any other component shutdown, but probably we can mark this as invalid.
Flags: needinfo?(amarchesini) → needinfo?(mdaly)
Yeah, I think you're right.
Flags: needinfo?(mdaly)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.