Closed Bug 1363378 Opened 7 years ago Closed 7 years ago

[e10s] Tab processes are not killed after crashing Firefox

Categories

(Core :: Security: Process Sandboxing, defect, P1)

x86_64
Linux
defect

Tracking

()

VERIFIED FIXED
mozilla57
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- wontfix
firefox55 + wontfix
firefox56 + wontfix
firefox57 + verified
firefox58 --- verified

People

(Reporter: emilghitta, Assigned: jld)

References

Details

(Keywords: regression, Whiteboard: sb+)

Attachments

(2 files)

[Affected versions]:
Firefox 55.0a1 (Build Id:20170508100218)
Firefox 54.0b6 (Build Id:20170508210007)

[Unaffected versions]
Firefox 53.0.2 (Build Id:20170504175505)
Firefox 52.1.1 esr (Build Id:20170504112025)

[Affected platforms]:
Ubuntu 16.04 LTS 64bit

[Steps to reproduce]:
1.Launch Firefox with a clean profile.
2.Enable e10s (if it is not already enabled).
3.Got to https://www.youtube.com webpage.
4.Open 2 random videos.
5.Install Crash Me Now add-on.
6.Crash the browser.


[Expected result]:
After crashing Firefox, the audio stops playback and the opened processes are killed.

[Actual result]:
After crashing Firefox, the audio continues to playback and the opened processes are not killed. The audio can be heard even if starting Firefox again after submitting the crash report.

[Regression range]:

I think that this issue requires finding the regression manually.
I will get back to this as soon as time permits.

[Additional notes]
Please note that this issue occurs only when e10s is enabled.

Please note that the Web Content process is not killed after Firefox crashed.

For more information regarding this issue please observe the following video: https://drive.google.com/open?id=0B94iuIVMr-TER21BbTlncURPRGM
Depends on: e10s
No longer depends on: e10s
Blocks: e10s
This issue is reproducible with 56.0a1 (Build Id:20170712100301) using Ubuntu 16.04 64 bit by following the up mentioned steps.
Severity: normal → major
Mike, can you take look at these crashes or help find someone to investigate?
Flags: needinfo?(mconley)
Sounds like media stuff to me... perhaps jwwang knows what's going on?
Flags: needinfo?(mconley) → needinfo?(jwwang)
If it's easy to reproduce maybe a quick regression test would help?
Keywords: regression
As noted in comment 0, the Web Content process is not killed after Firefox crashed. The audio decoder still continues to decode and output sound. If audio is expected to stop immediately, the system should kill the content process also.
Flags: needinfo?(jwwang)
I did a manual regression and got to the following result:

Last-good buildid: 2016-10-06 (20161006030208)
First-bad buildid: 2016-10-07 (20161007030207)

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b9944879c9a60a9aba4a744a7401bc38e0f39c4

Possible culprit:
43bc30457ad4	JW Wang — Bug 1308066 - remove mLowDataThresholdUsecs. r=kikuo
47f0358bfff2	JW Wang — Bug 1308065 - move mBufferingWait into BufferingState. r=kikuo
Requesting needinfo from JW Wang since he is the one that fixed the possible identified culprit.
Flags: needinfo?(jwwang)
(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b99
> 44879c9a60a9aba4a744a7401bc38e0f39c4

That list is pretty long. Please use mozregression to further nail down the range. Usually it gives you the exact changeset at the end, by also caring about different branches (eg. merges from autoland or inbound)
Flags: needinfo?(bogdan.maris)
(In reply to Henrik Skupin (:whimboo) from comment #8)
> (In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> > https://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b99
> > 44879c9a60a9aba4a744a7401bc38e0f39c4
> 
> That list is pretty long. Please use mozregression to further nail down the
> range. Usually it gives you the exact changeset at the end, by also caring
> about different branches (eg. merges from autoland or inbound)

Due to the nature of the bug (it involves crash) I am unable to use mozregression to determine a more accurate range (this is why Emil commented in #0 that it would require manual regression). Also I tried to use tinderbox builds but I never reproduced on them, not even in more recent builds where the issue should have reproduced with no problems.
Flags: needinfo?(bogdan.maris)
JW, from comment 5, are you intending to work on this? Or do you think it should go back to the DOM/content folks, or to someone else on the Audio/Video team?   Are you able to reproduce it in Nightly?
Yes I can repro this issue in Nightly. However, I don't think this is a media bug as stated in comment 5.
Flags: needinfo?(jwwang)
(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> I did a manual regression and got to the following result:
> 
> Last-good buildid: 2016-10-06 (20161006030208)
> First-bad buildid: 2016-10-07 (20161007030207)
> 
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b99
> 44879c9a60a9aba4a744a7401bc38e0f39c4
> 
> Possible culprit:
> 43bc30457ad4	JW Wang — Bug 1308066 - remove mLowDataThresholdUsecs. r=kikuo
> 47f0358bfff2	JW Wang — Bug 1308065 - move mBufferingWait into
> BufferingState. r=kikuo

There is no way for these bugs to be the culprit.

Bug 1308066 changes the type of some members. No call flow is changed.
Bug 1308065 moves a member from class A to class B. No call flow is changed.
It's been a while since the report. Can you please re-check if this is still a problem?
Flags: needinfo?(emil.ghitta)
Hi Henrik!

Indeed, this issue is still reproducible. 

I managed to reproduce it on Firefox 57.0a1(Build Id:20170809100326)and it seems that this issue "rides the trains" since it is reproducible with Firefox 56.0b1(BuildId:20170808170225) now.

Thanks!
Flags: needinfo?(emil.ghitta)
And this happens for Windows only? Have you tried those steps on other platforms yet?
I managed to reproduce this issue only on Linux (as stated in comment 0) having e10 enabled.

This issue is not reproducible on Windows and macOs.
Emil, it might be good to get a process sample (dump) from one or multiple of those still runnning webcontent processes.
Flags: needinfo?(emil.ghitta)
Hi Henrik!

I generated 2 process dumps (from 2 running webcontent processes):
https://drive.google.com/open?id=0BzOkHBmdnAoaVkxTOGNDNmRkeUE

Hope that this helps.
Flags: needinfo?(emil.ghitta)
(In reply to Emil Ghitta, QA [:emilghitta] from comment #18)
> I generated 2 process dumps (from 2 running webcontent processes):
> https://drive.google.com/open?id=0BzOkHBmdnAoaVkxTOGNDNmRkeUE

Each of those dumps are 1.4GB in size! Not sure if that is really helpful to get it analyzed. Karl, what would be a good way to get a process sample of content processes, which continue to run after Firefox has been closed on Linux?
Flags: needinfo?(karlt)
I guess kill -ABRT <content-process-id> fails to produce a crash report because there is no browser process to manage the submission?

If you can find a build with symbols, then gdb can produce a backtrace.
Perhaps ASAN builds have symbols?  I don't know.

But a backtrace may not be helpful.  As the process is still running apparently, it has moved on from when it should have responded.

To diagnose this, I'd find the code that should detect the pipe close, and enable logging to find out what happened after that.

Finding a regression range would be the simplest way to track down a likely cause.
I'd start with the builds under https://tools.taskcluster.net/index/gecko.v2.mozilla-central.pushdate.2016.10.06
Flags: needinfo?(karlt)
(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> I did a manual regression and got to the following result:
> 
> Last-good buildid: 2016-10-06 (20161006030208)
> First-bad buildid: 2016-10-07 (20161007030207)

Ok, so I setup up my old Linux machine again, and installed both builds. I can verify this regression range at least now. 

What I observed is that when I close the next instance of Firefox as started by the crash manager all content processes from the first instance are also gone. It also helps just to wait about 30s. Then at least any played audio is also gone but the content processes remain running.

Before crashing the browser:

$ ps -ef | grep 6531
henrik    6531     1 35 11:20 ?        00:00:09 /home/henrik/code/gecko/firefox/firefox
henrik    6648  6531 66 11:20 ?        00:00:12 /home/henrik/code/gecko/firefox/plugin-container -greomni /home/henrik/code/gecko/firefox/omni.ja -appomni /home/henrik/code/gecko/firefox/browser/omni.ja -appdir /home/henrik/code/gecko/firefox/browser 6531 true tab

After crashing with crash reporter open:

henrik    6648     1 39 11:20 ?        00:00:28 /home/henrik/code/gecko/firefox/plugin-container -greomni /home/henrik/code/gecko/firefox/omni.ja -appomni /home/henrik/code/gecko/firefox/browser/omni.ja -appdir /home/henrik/code/gecko/firefox/browser 6531 true tab
henrik    6737     1  1 11:21 ?        00:00:00 /home/henrik/code/gecko/firefox/crashreporter /home/henrik/.mozilla/firefox/mwad0hks.default/minidumps/2bf8a428-f4b0-1fe5-733decef-578f4de8.dmp

After the restart via crash manager:

henrik    6648     1 27 11:20 ?        00:00:31 /home/henrik/code/gecko/firefox/plugin-container -greomni /home/henrik/code/gecko/firefox/omni.ja -appomni /home/henrik/code/gecko/firefox/browser/omni.ja -appdir /home/henrik/code/gecko/firefox/browser 6531 true tab
henrik    6744     1 25 11:22 ?        00:00:02 /home/henrik/code/gecko/firefox/firefox

> I guess kill -ABRT <content-process-id> fails to produce a crash report because there is no browser process to manage the submission?

Indeed, I tried that but without any success.

I narrowed down the range to the following merge changeset from autoland:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=da986c9f1f723af1e0c44f4ccd4cddd5fb6084e8&tochange=d8e1f5cf0a70a53e8a5532809096a0a5bf729196

I will continue with builds from autoland to get the exact causing bug.
Here the final regression range:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=3e9a2031152f&tochange=c838d2546cad

So it's related to the sandbox changes as made on bug 1289718. Gian-Carlo can you please have a look?
Blocks: 1289718
Flags: needinfo?(gpascutto)
Flags: needinfo?(gpascutto)
Whiteboard: sb?
Track 56+/57+ as regression.
Assignee: nobody → gpascutto
Component: DOM: Content Processes → Security: Process Sandboxing
Priority: -- → P1
Whiteboard: sb? → sb+
Can you reproduce this with MOZ_SANDBOX_LOGGING=1 set in the environement?  (For old builds before bug 1317735, which landed in 54.0a1, use MOZ_SANDBOX_VERBOSE=1 instead.)

My guess is that something is reacting badly to the error code we return when the broker hangs up the connection.  Specifically, I think we just pass along EPIPE from the socket I/O, but EIO would be more appropriate, and getting the caller to actually give up might need something else entirely (or maybe just unilaterally killing the process).

The logs should at least let us see what file is affected, which could help pin down what's getting stuck.
Flags: needinfo?(hskupin)
Attached file sandbox logging output
Jed, as attached you can find the output from sandbox logging. Around lines 2508 the crash occurred and I let the crash reporter restart Firefox. At the end I closed Firefox normally. I hope there is some useful data in here for you to investigate.
Flags: needinfo?(hskupin)
I can also prevent this behavior when changing the sandbox level back to 1.
So it looks as if it's the BrokerClient (in a crash reporting child?) that fails to connect to the Broker (in the parent), gets an EIO and ends up confused?
So this is weird:

Sandbox: rewriting /proc/self/exe -> /proc/4525/exe
Sandbox: Recording mapping /home/henrik/code/gecko/firefox/firefox -> /proc/4525/exe
Sandbox: SandboxBroker: denied op=open rflags=0 perms=0 path= for pid=4525
Sandbox: Failed errno -13 op 0 flags 00 path 
Sandbox: SandboxBroker: denied op=open rflags=0 perms=0 path= for pid=4525
Sandbox: Failed errno -13 op 0 flags 00 path 

It's trying to pass open() an empty string.  Normally that's ENOENT, but we're returning EACCES.  If something is treating ENOENT specially that might be it.  We could try to find out what's making that call by crashing in the broker client if the path is empty.

Also, there aren't too many uses of /proc/self/exe in our code — the profiler is one, and a copy of libbacktrace used as part of the Rust crate backtrace-sys (media has been using Rust for a while) is another.  Or it could be a system library.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #28)
> So it looks as if it's the BrokerClient (in a crash reporting child?) that
> fails to connect to the Broker (in the parent), gets an EIO and ends up
> confused?

We're not returning EIO, I don't think; we just return the error from sendmsg, which is probably EPIPE if the parent has exited.  I was suggesting that EIO is probably what a "real" filesystem would do if it couldn't access its underlying storage, and EPIPE is really not an expected response to open()ing a regular file; it probably doesn't make a difference, but maybe it does.
Mark 55- and 56- as we are only 1 week away from RC.
I can reproduce this, and I'm reasonably sure I know what's happening: the file broker socket isn't close-on-exec, so the crashreporter process inherits a copy of it, and the child process blocks in the broker client waiting for a response it will never get.  In particular, it blocks in the Gecko IPC I/O thread (trying to open /tmp/GeckoChildCrash%d.extra), and that's probably the thread that would normally notice that the parent process went away.
Assignee: gpascutto → jld
Reproducing this on a debug build requires some environment variables: MOZ_CRASHREPORTER=t (debug builds disable the normal crash reporter by default), as well as the misleadingly named XRE_NO_WINDOWS_CRASH_DIALOG=t — without the latter, even with this bug fixed, when the I/O thread returns from the broker call and crashes due to the IPC error caused by the parent process's nonexistence, it will do the usual debug-build “sleeping for 300 seconds” “Type … to attach your debugger” thing (probably because Breakpad tries to request a minidump from the nonexistent parent process and fails, in which case it delegates to the previous signal handler instead of immediately exiting), and the phantom audio will continue for a while.
Comment on attachment 8907882 [details]
Bug 1363378 - Set close-on-exec in sandbox-related sockets held by parent process.

https://reviewboard.mozilla.org/r/179568/#review184848
Attachment #8907882 - Flags: review?(gpascutto) → review+
Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b549b2ed2efc
Set close-on-exec in sandbox-related sockets held by parent process. r=gcp
https://hg.mozilla.org/mozilla-central/rev/b549b2ed2efc
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
I have verified that this issue is no longer reproducible using the "Crash Me Now (Simple)"add-on, on Firefox Nightly 58.0a1 (BuildId:20171002220204) with Ubuntu 16.04 64bit.

This issues is also verified fixed on Firefox 57.0b4 (BuildId:20170928180207) with Ubuntu 16.04 64bit using the following snippet: 

Cu.import("resource://gre/modules/ctypes.jsm");
let zero = new ctypes.intptr_t(8);
let badptr = ctypes.cast(zero, ctypes.PointerType(ctypes.int32_t));
badptr.contents;
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: