1363378 - [e10s] Tab processes are not killed after crashing Firefox

Reporter

Description

•

7 years ago

[Affected versions]:
Firefox 55.0a1 (Build Id:20170508100218)
Firefox 54.0b6 (Build Id:20170508210007)

[Unaffected versions]
Firefox 53.0.2 (Build Id:20170504175505)
Firefox 52.1.1 esr (Build Id:20170504112025)

[Affected platforms]:
Ubuntu 16.04 LTS 64bit

[Steps to reproduce]:
1.Launch Firefox with a clean profile.
2.Enable e10s (if it is not already enabled).
3.Got to https://www.youtube.com webpage.
4.Open 2 random videos.
5.Install Crash Me Now add-on.
6.Crash the browser.


[Expected result]:
After crashing Firefox, the audio stops playback and the opened processes are killed.

[Actual result]:
After crashing Firefox, the audio continues to playback and the opened processes are not killed. The audio can be heard even if starting Firefox again after submitting the crash report.

[Regression range]:

I think that this issue requires finding the regression manually.
I will get back to this as soon as time permits.

[Additional notes]
Please note that this issue occurs only when e10s is enabled.

Please note that the Web Content process is not killed after Firefox crashed.

For more information regarding this issue please observe the following video: https://drive.google.com/open?id=0B94iuIVMr-TER21BbTlncURPRGM

Emil Ghitta, QA [:emilghitta]

Reporter

Updated

•

7 years ago

Depends on: e10s

Emil Ghitta, QA [:emilghitta]

Reporter

Updated

•

7 years ago

No longer depends on: e10s

Emil Ghitta, QA [:emilghitta]

Reporter

Updated

•

7 years ago

Blocks: e10s

Emil Ghitta, QA [:emilghitta]

Reporter

Comment 1

•

7 years ago

This issue is reproducible with 56.0a1 (Build Id:20170712100301) using Ubuntu 16.04 64 bit by following the up mentioned steps.

Severity: normal → major

status-firefox56: --- → affected

Liz Henry (:lizzard) (relman/hg->git project)

Comment 2

•

7 years ago

Mike, can you take look at these crashes or help find someone to investigate?

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Comment 3

•

7 years ago

Sounds like media stuff to me... perhaps jwwang knows what's going on?

Flags: needinfo?(mconley) → needinfo?(jwwang)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 4

•

7 years ago

If it's easy to reproduce maybe a quick regression test would help?

Keywords: regression

JW Wang [:jwwang] [:jw_wang]

Comment 5

•

7 years ago

As noted in comment 0, the Web Content process is not killed after Firefox crashed. The audio decoder still continues to decode and output sound. If audio is expected to stop immediately, the system should kill the content process also.

Flags: needinfo?(jwwang)

Bogdan Maris, Desktop QA

Comment 6

•

7 years ago

I did a manual regression and got to the following result:

Last-good buildid: 2016-10-06 (20161006030208)
First-bad buildid: 2016-10-07 (20161007030207)

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b9944879c9a60a9aba4a744a7401bc38e0f39c4

Possible culprit:
43bc30457ad4	JW Wang — Bug 1308066 - remove mLowDataThresholdUsecs. r=kikuo
47f0358bfff2	JW Wang — Bug 1308065 - move mBufferingWait into BufferingState. r=kikuo

Keywords: regressionwindow-wanted

Bogdan Maris, Desktop QA

Comment 7

•

7 years ago

Requesting needinfo from JW Wang since he is the one that fixed the possible identified culprit.

Flags: needinfo?(jwwang)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 8

•

7 years ago

(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b99
> 44879c9a60a9aba4a744a7401bc38e0f39c4

That list is pretty long. Please use mozregression to further nail down the range. Usually it gives you the exact changeset at the end, by also caring about different branches (eg. merges from autoland or inbound)

Flags: needinfo?(bogdan.maris)

Bogdan Maris, Desktop QA

Comment 9

•

7 years ago

(In reply to Henrik Skupin (:whimboo) from comment #8)
> (In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> > https://hg.mozilla.org/mozilla-central/
> > pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b99
> > 44879c9a60a9aba4a744a7401bc38e0f39c4
> 
> That list is pretty long. Please use mozregression to further nail down the
> range. Usually it gives you the exact changeset at the end, by also caring
> about different branches (eg. merges from autoland or inbound)

Due to the nature of the bug (it involves crash) I am unable to use mozregression to determine a more accurate range (this is why Emil commented in #0 that it would require manual regression). Also I tried to use tinderbox builds but I never reproduced on them, not even in more recent builds where the issue should have reproduced with no problems.

Flags: needinfo?(bogdan.maris)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 10

•

7 years ago

JW, from comment 5, are you intending to work on this? Or do you think it should go back to the DOM/content folks, or to someone else on the Audio/Video team?   Are you able to reproduce it in Nightly?

status-firefox54: affected → wontfix

tracking-firefox55: --- → +

JW Wang [:jwwang] [:jw_wang]

Comment 11

•

7 years ago

Yes I can repro this issue in Nightly. However, I don't think this is a media bug as stated in comment 5.

Flags: needinfo?(jwwang)

JW Wang [:jwwang] [:jw_wang]

Comment 12

•

7 years ago

(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> I did a manual regression and got to the following result:
> 
> Last-good buildid: 2016-10-06 (20161006030208)
> First-bad buildid: 2016-10-07 (20161007030207)
> 
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=777fb63db8de8d78b64b8141d1a998397275dd03&tochange=4b99
> 44879c9a60a9aba4a744a7401bc38e0f39c4
> 
> Possible culprit:
> 43bc30457ad4	JW Wang — Bug 1308066 - remove mLowDataThresholdUsecs. r=kikuo
> 47f0358bfff2	JW Wang — Bug 1308065 - move mBufferingWait into
> BufferingState. r=kikuo

There is no way for these bugs to be the culprit.

Bug 1308066 changes the type of some members. No call flow is changed.
Bug 1308065 moves a member from class A to class B. No call flow is changed.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 13

•

7 years ago

It's been a while since the report. Can you please re-check if this is still a problem?

Flags: needinfo?(emil.ghitta)

Emil Ghitta, QA [:emilghitta]

Reporter

Comment 14

•

7 years ago

Hi Henrik!

Indeed, this issue is still reproducible. 

I managed to reproduce it on Firefox 57.0a1(Build Id:20170809100326)and it seems that this issue "rides the trains" since it is reproducible with Firefox 56.0b1(BuildId:20170808170225) now.

Thanks!

status-firefox57: --- → affected

Flags: needinfo?(emil.ghitta)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 15

•

7 years ago

And this happens for Windows only? Have you tried those steps on other platforms yet?

Emil Ghitta, QA [:emilghitta]

Reporter

Comment 16

•

7 years ago

I managed to reproduce this issue only on Linux (as stated in comment 0) having e10 enabled.

This issue is not reproducible on Windows and macOs.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 17

•

7 years ago

Emil, it might be good to get a process sample (dump) from one or multiple of those still runnning webcontent processes.

Flags: needinfo?(emil.ghitta)

Emil Ghitta, QA [:emilghitta]

Reporter

Comment 18

•

7 years ago

Hi Henrik!

I generated 2 process dumps (from 2 running webcontent processes):
https://drive.google.com/open?id=0BzOkHBmdnAoaVkxTOGNDNmRkeUE

Hope that this helps.

Flags: needinfo?(emil.ghitta)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 20

•

7 years ago

(In reply to Emil Ghitta, QA [:emilghitta] from comment #18)
> I generated 2 process dumps (from 2 running webcontent processes):
> https://drive.google.com/open?id=0BzOkHBmdnAoaVkxTOGNDNmRkeUE

Each of those dumps are 1.4GB in size! Not sure if that is really helpful to get it analyzed. Karl, what would be a good way to get a process sample of content processes, which continue to run after Firefox has been closed on Linux?

Flags: needinfo?(karlt)

Karl Tomlinson (:karlt)

Comment 21

•

7 years ago

I guess kill -ABRT <content-process-id> fails to produce a crash report because there is no browser process to manage the submission?

If you can find a build with symbols, then gdb can produce a backtrace.
Perhaps ASAN builds have symbols?  I don't know.

But a backtrace may not be helpful.  As the process is still running apparently, it has moved on from when it should have responded.

To diagnose this, I'd find the code that should detect the pipe close, and enable logging to find out what happened after that.

Finding a regression range would be the simplest way to track down a likely cause.
I'd start with the builds under https://tools.taskcluster.net/index/gecko.v2.mozilla-central.pushdate.2016.10.06

Flags: needinfo?(karlt)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 22

•

7 years ago

(In reply to Bogdan Maris, QA [:bogdan_maris] from comment #6)
> I did a manual regression and got to the following result:
> 
> Last-good buildid: 2016-10-06 (20161006030208)
> First-bad buildid: 2016-10-07 (20161007030207)

Ok, so I setup up my old Linux machine again, and installed both builds. I can verify this regression range at least now. 

What I observed is that when I close the next instance of Firefox as started by the crash manager all content processes from the first instance are also gone. It also helps just to wait about 30s. Then at least any played audio is also gone but the content processes remain running.

Before crashing the browser:

$ ps -ef | grep 6531
henrik    6531     1 35 11:20 ?        00:00:09 /home/henrik/code/gecko/firefox/firefox
henrik    6648  6531 66 11:20 ?        00:00:12 /home/henrik/code/gecko/firefox/plugin-container -greomni /home/henrik/code/gecko/firefox/omni.ja -appomni /home/henrik/code/gecko/firefox/browser/omni.ja -appdir /home/henrik/code/gecko/firefox/browser 6531 true tab

After crashing with crash reporter open:

henrik    6648     1 39 11:20 ?        00:00:28 /home/henrik/code/gecko/firefox/plugin-container -greomni /home/henrik/code/gecko/firefox/omni.ja -appomni /home/henrik/code/gecko/firefox/browser/omni.ja -appdir /home/henrik/code/gecko/firefox/browser 6531 true tab
henrik    6737     1  1 11:21 ?        00:00:00 /home/henrik/code/gecko/firefox/crashreporter /home/henrik/.mozilla/firefox/mwad0hks.default/minidumps/2bf8a428-f4b0-1fe5-733decef-578f4de8.dmp

After the restart via crash manager:

henrik    6648     1 27 11:20 ?        00:00:31 /home/henrik/code/gecko/firefox/plugin-container -greomni /home/henrik/code/gecko/firefox/omni.ja -appomni /home/henrik/code/gecko/firefox/browser/omni.ja -appdir /home/henrik/code/gecko/firefox/browser 6531 true tab
henrik    6744     1 25 11:22 ?        00:00:02 /home/henrik/code/gecko/firefox/firefox

> I guess kill -ABRT <content-process-id> fails to produce a crash report because there is no browser process to manage the submission?

Indeed, I tried that but without any success.

I narrowed down the range to the following merge changeset from autoland:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=da986c9f1f723af1e0c44f4ccd4cddd5fb6084e8&tochange=d8e1f5cf0a70a53e8a5532809096a0a5bf729196

I will continue with builds from autoland to get the exact causing bug.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 23

•

7 years ago

Here the final regression range:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=3e9a2031152f&tochange=c838d2546cad

So it's related to the sandbox changes as made on bug 1289718. Gian-Carlo can you please have a look?

Blocks: 1289718

status-firefox-esr52: unaffected → ?

tracking-firefox56: --- → ?

tracking-firefox57: --- → ?

Flags: needinfo?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Updated

•

7 years ago

Flags: needinfo?(gpascutto)

Whiteboard: sb?

Gerry Chang [:gchang]

Comment 24

•

7 years ago

Track 56+/57+ as regression.

tracking-firefox56: ? → +

tracking-firefox57: ? → +

Jim Mathies [:jimm]

Updated

•

7 years ago

Assignee: nobody → gpascutto

Component: DOM: Content Processes → Security: Process Sandboxing

Priority: -- → P1

Whiteboard: sb? → sb+

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 25

•

7 years ago

Can you reproduce this with MOZ_SANDBOX_LOGGING=1 set in the environement?  (For old builds before bug 1317735, which landed in 54.0a1, use MOZ_SANDBOX_VERBOSE=1 instead.)

My guess is that something is reacting badly to the error code we return when the broker hangs up the connection.  Specifically, I think we just pass along EPIPE from the socket I/O, but EIO would be more appropriate, and getting the caller to actually give up might need something else entirely (or maybe just unilaterally killing the process).

The logs should at least let us see what file is affected, which could help pin down what's getting stuck.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 26

•

7 years ago

Attached file sandbox logging output — Details

Jed, as attached you can find the output from sandbox logging. Around lines 2508 the crash occurred and I let the crash reporter restart Firefox. At the end I closed Firefox normally. I hope there is some useful data in here for you to investigate.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 27

•

7 years ago

I can also prevent this behavior when changing the sandbox level back to 1.

Gian-Carlo Pascutto [:gcp]

Comment 28

•

7 years ago

So it looks as if it's the BrokerClient (in a crash reporting child?) that fails to connect to the Broker (in the parent), gets an EIO and ends up confused?

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 29

•

7 years ago

So this is weird:

Sandbox: rewriting /proc/self/exe -> /proc/4525/exe
Sandbox: Recording mapping /home/henrik/code/gecko/firefox/firefox -> /proc/4525/exe
Sandbox: SandboxBroker: denied op=open rflags=0 perms=0 path= for pid=4525
Sandbox: Failed errno -13 op 0 flags 00 path 
Sandbox: SandboxBroker: denied op=open rflags=0 perms=0 path= for pid=4525
Sandbox: Failed errno -13 op 0 flags 00 path 

It's trying to pass open() an empty string.  Normally that's ENOENT, but we're returning EACCES.  If something is treating ENOENT specially that might be it.  We could try to find out what's making that call by crashing in the broker client if the path is empty.

Also, there aren't too many uses of /proc/self/exe in our code — the profiler is one, and a copy of libbacktrace used as part of the Rust crate backtrace-sys (media has been using Rust for a while) is another.  Or it could be a system library.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #28)
> So it looks as if it's the BrokerClient (in a crash reporting child?) that
> fails to connect to the Broker (in the parent), gets an EIO and ends up
> confused?

We're not returning EIO, I don't think; we just return the error from sendmsg, which is probably EPIPE if the parent has exited.  I was suggesting that EIO is probably what a "real" filesystem would do if it couldn't access its underlying storage, and EPIPE is really not an expected response to open()ing a regular file; it probably doesn't make a difference, but maybe it does.

Gerry Chang [:gchang]

Comment 30

•

7 years ago

Mark 55- and 56- as we are only 1 week away from RC.

status-firefox55: affected → wontfix

status-firefox56: affected → wontfix

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 31

•

7 years ago

I can reproduce this, and I'm reasonably sure I know what's happening: the file broker socket isn't close-on-exec, so the crashreporter process inherits a copy of it, and the child process blocks in the broker client waiting for a response it will never get.  In particular, it blocks in the Gecko IPC I/O thread (trying to open /tmp/GeckoChildCrash%d.extra), and that's probably the thread that would normally notice that the parent process went away.

Assignee: gpascutto → jld

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 32

•

7 years ago

Reproducing this on a debug build requires some environment variables: MOZ_CRASHREPORTER=t (debug builds disable the normal crash reporter by default), as well as the misleadingly named XRE_NO_WINDOWS_CRASH_DIALOG=t — without the latter, even with this bug fixed, when the I/O thread returns from the broker call and crashes due to the IPC error caused by the parent process's nonexistence, it will do the usual debug-build “sleeping for 300 seconds” “Type … to attach your debugger” thing (probably because Breakpad tries to request a minidump from the nonexistent parent process and fails, in which case it delegates to the previous signal handler instead of immediately exiting), and the phantom audio will continue for a while.

Comment hidden (mozreview-request)

If these aren't close-on-exec, they can be inherited by the crash
reporter process after the parent process has crashed and exited,
causing child processes to continue running when the IPC I/O thread blocks
in the file broker trying to open a GeckoChildCrash temp file.
(Empirically, the main thread then blocks waiting for the I/O thread.)

Operations that run on dedicated threads, like playing media, may
continue even though the main and IPC threads are locked up, resulting in
videos that keep playing sound even though the browser seems to no longer
exist.

If the broker socket is closed as expected when the parent process
exits, the child will return failure from the brokered file operation
and then go on to get an IPC error due to the parent process's
nonexistence, and will exit as normal.

This patch makes the same change to rejected syscall reporting, even
though that's a one-way asynchronous message with no response to wait
for, just in case something goes wrong enough to fill the entire socket
buffer but not so badly broken that it would wind up in an infinite loop
anyway.

SOCK_CLOEXEC has been present since Linux 2.6.26, and it would be used
only if seccomp-bpf is available, so it should be safe to use
unconditionally.

Review commit: https://reviewboard.mozilla.org/r/179568/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/179568/

Gian-Carlo Pascutto [:gcp]

Comment 34

•

7 years ago

mozreview-review

Comment on attachment 8907882 [details]
Bug 1363378 - Set close-on-exec in sandbox-related sockets held by parent process.

https://reviewboard.mozilla.org/r/179568/#review184848

Attachment #8907882 - Flags: review?(gpascutto) → review+

Pulsebot

Comment 35

•

7 years ago

Pushed by gpascutto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b549b2ed2efc
Set close-on-exec in sandbox-related sockets held by parent process. r=gcp

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 36

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/b549b2ed2efc

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox57: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox-esr52: ? → unaffected

Emil Ghitta, QA [:emilghitta]

Reporter

Comment 38

•

7 years ago

I have verified that this issue is no longer reproducible using the "Crash Me Now (Simple)"add-on, on Firefox Nightly 58.0a1 (BuildId:20171002220204) with Ubuntu 16.04 64bit.

This issues is also verified fixed on Firefox 57.0b4 (BuildId:20170928180207) with Ubuntu 16.04 64bit using the following snippet: 

Cu.import("resource://gre/modules/ctypes.jsm");
let zero = new ctypes.intptr_t(8);
let badptr = ctypes.cast(zero, ctypes.PointerType(ctypes.int32_t));
badptr.contents;

Status: RESOLVED → VERIFIED

status-firefox57: fixed → verified

status-firefox58: --- → verified

sandbox logging output 7 years ago Henrik Skupin [:whimboo][⌚️UTC+1] 175.04 KB, text/plain		Details
Bug 1363378 - Set close-on-exec in sandbox-related sockets held by parent process. 7 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 59 bytes, text/x-review-board-request	gcp : review+	Details