Closed Bug 1243108 Opened 8 years ago Closed 6 years ago

Intermittent failure to receive sandbox broker responses (SandboxBrokerTest.MultiThreadOpen, SandboxBrokerTest.MultiThreadStat, or “Sandbox: Unexpected EOF” by itself near non-shutdown crashes)

Categories

(Core :: Security: Process Sandboxing, defect, P1)

43 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox48 --- wontfix
firefox49 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox63 --- fixed

People

(Reporter: KWierso, Assigned: jld)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: sb+)

Attachments

(2 files)

-5 means this:

> #define EIO              5      /* I/O error */

But that's not from the actual open(); this is also in the test log:

> Sandbox: Unexpected EOF, op 0 flags 00 path /dev/zero

That's from the broker client.  The parent is supposed to write a SandboxBrokerCommon::Response and attach a file descriptor; the child got EOF instead.  I need to look at this a little more closely.
Component: Security → Security: Process Sandboxing
To clarify: for the gtests this is in multiple threads in the same process; the “child” role is taken by the main thread and the “parent” is the broker thread.

I've heard about implementations of AF_UNIX sockets having bugs when a socket is closed and there's outstanding data not yet read by the peer, but I think that was specific to OS X.  Linux seems to handle that fine in the general case; I tried adding some nanosleep() calls to adjust the timing but they didn't reproduce this.

This code isn't being used for real on desktop yet, but it (or something equivalent) will be needed eventually; it is used on B2G, but B2G is now Tier 3.  So this test could be disabled if the failures become too common and there's no solution in sight, but that would invite bit rot.

This particular test invokes the broker client in a tight loop on multiple threads concurrently, to try to catch thread-unsafety in the client or broker bugs that would mix up different requests' respnses somehow.  It's almost but not quite entirely implausible that there's an uncommon race condition in the Linux kernel this is being run on (still a heavily patched 3.2.x, I think?), which went undetected because normal workloads don't stress the SCM_RIGHTS code like that.
Summary: Intermittent SandboxBrokerTest.MultiThreadOpen | Expected: (zerofd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms) → Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms)
Blocks: sb-test
Whiteboard: sb+
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
(In reply to Jed Davis [:jld] {⏰UTC-7} from comment #2)
> This code isn't being used for real on desktop yet, but it (or something
> equivalent) will be needed eventually

“Eventually” is now: we've been using this code on Nightly since bug 1289718 landed.

It would really help if this bug could be reproduced under rr.  I tried running the tests locally under rr chaos mode, but no luck.
Blocks: 1317084
Summary: Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms) → Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0) or (nullfd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms)
See Also: → 1421201
This seems to be turning up “for real” now that we're using the broker on desktop: see bug 1420475 and bug 1421201.  I added a log message in bug 1422198 for the case where sendmsg() fails, but it's not happening.  So the message is sent but never arrives.

I believe we're using much newer kernels for CI now (new enough to include unprivileged user namespaces and seccomp thread sync, at least), so the last part of comment #2 is probably not it.

I suppose it could be a kernel “feature”, like if it's not guaranteed that the sendmsg->close->recvmsg ordering won't drop the message but in practice that happens only in some odd edge case, but I don't know how much sense that makes.  (That *is* a problem with TCP, if the data is buffered locally but not transmitted yet — see http://blog.csdn.net/CPP_CHEN/article/details/29864509 — but there's no reason that would apply to AF_UNIX.  In any case, Chromium is doing the same thing we are, although they may not be using this kind of brokering as much as we do.)

In another bug I was speculating about a double-close or use-after-close bug somewhere, but I'm having trouble coming up with a plausible scenario where we wouldn't at least sometimes see actual errors, not just the message disappearing.

If the “missing” message was getting written to the wrong request's response socket, *after* the correct response was sent (these tests, if I recall correctly from when I wrote them years ago, try to check for concurrent requests getting their responses swapped), then that would show up like this.  But I don't have a plausible mechanism for how a bug would get us into that state without sometimes causing other failures instead.  This at least would be relatively easy to test: do a second recvmsg in the client and assert that it gets EOF.

I'm really just guessing at this point.
See Also: → 1420475
Whiteboard: sb+ → sb?
One thing I notice, in the 4 instances we still have logs for, are the loop counters:

Loop 10000/10000 @ /builds/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:419
Loop 674/10000 @ /home/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:408
Loop 9397/10000 @ /builds/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:420
Loop 42/10000 @ /builds/worker/workspace/build/src/security/sandbox/linux/gtest/TestBroker.cpp:419

They're all near the beginning or end of the loop (so maybe when other threads are starting/exiting?), and one of them is the last of 10k iterations which is kind of suspect just by itself.


(Also, the canonical version of that article about TCP seems to be https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable )
Something that's worth trying: burn a few CPU-hours trying to reproduce this locally under rr.  If it replays, then I can try to see what's going on with the file descriptors.  Or, if the replay diverges, that would suggest a kernel bug.
Flags: needinfo?(jld)
I did run this a few dozen times in rr (with and without chaos mode), but it turns out to be essentially a worst case for recording overhead because it's almost all context switches and nontrivial syscalls; see https://github.com/mozilla/rr/issues/2159

And I haven't reproduced it yet, but it's possible that letting it run for longer might help….
I've run this about 1,000 times under rr now, and the bug hasn't happened.  I also tried running it 1,000 times locally *not* under rr, and… the bug also didn't reproduce.  So this may not be a useful approach.
Flags: needinfo?(jld)
Whiteboard: sb? → sb+
See Also: → 1442989
See Also: → 1439140
Unfortunately, this doesn't seem to have been a double-close caused by bug 1278361, because we're still seeing it.  (This isn't a big surprise — see earlier comments that the observables here don't make sense for that kind of bug — but it sure would have been convenient if it had mysteriously been fixed.  Sigh.)

Also, bug 1456022 seems to be another manifestation of this.
See Also: → 1456022
I can reproduce this locally, given an hour or two: I increased the loop count in the tests by 10x (to amortize startup/shtudown/scripting overhead) and ran mach gtest in a shell loop, with an opt build, and got 1 hit in ~400 runs; a debug build also worked but took a lot longer (1 in ~1800).  Note that this roughly corresponds to 4,000 opt runs of the normal test.

So, things that can be done:

* Send one or more fake responses with a fictitious error code (or otherwise marked) after the real response, so the client can detect the case where the response was lost.  This assumes that whatever is happening is a more-or-less independent chance per message, not some property of the response socket itself.

* Send the real response multiple times, to work around lost responses if that's what's going on.  (I hope we won't have to do that.)

* If I can be sufficiently sure that an unexpected EOF means the request was never received (see above), add a retry loop in the client.  (Retrying operations that may or may not have already happened is… delicate, and I don't want to go there if I don't have to; see also NFS.)

* Add a variant of the multithread test where the clients alternate opening a per-thread temp file with O_CREAT | O_EXCL and unlinking it — either a dropped request or a duplicated one will cause an error.  (mkdir + rmdir would also work.)
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #40)
> * Send one or more fake responses with a fictitious error code (or otherwise
> marked) after the real response, so the client can detect the case where the
> response was lost.

I did this and ran it overnight: 4 hits in ~1600 runs, all of them apparently a lost request rather than a lost response.

> * Add a variant of the multithread test where the clients alternate opening
> a per-thread temp file with O_CREAT | O_EXCL and unlinking it — either a
> dropped request or a duplicated one will cause an error.  (mkdir + rmdir
> would also work.)

I did this (both creat+unlink and mkdir+rmdir) and ran it all afternoon and evening, and I got one error after 1388 runs (~2.78e9 broker calls) — a lost rmdir that succeeded when reissued.  Changing the operations seems to affect the error rate; this may be significant.

But that test closed the file before the unlink.  I modified it to close the file last (so that it's the broker client thread and not the server that causes the inode to be freed), and it's failed twice in only 160 iterations (80e6 calls): one a lost unlink that succeeded on retry, and one a lost open that did *not* — it failed with EEXIST.  And that's *with* the change to add a second response.  So now this is making even less sense than it did before.


Another possibility here is writing a self-contained C program that does the same socket operations as the broker test and run *that* for a few hours, to try to rule out everything else in Gecko and maybe point toward a kernel bug.
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #41)
> So now this is making even less sense than it did before.

It turns out I messed up the retry logic a little.  The open was retried, but the unlink (and rmdir from the earlier test) weren't.  So this means the requests *are* being received, but the responses are lost, even when multiple responses are sent.  This suggests that something is happening to the socket that's passed to the parent (but not to the non-socket fds passed back for the open test).

The creat+unlink+close test continues to be much more effective than the creat+close+unlink or mkdir+rmdir: 11 failures in 3041 runs (one per ~2.76e8 calls, or ~10x the rate).  4 were failed opens that failed with EEXIST when retried, and 7 were failed unlinks that did observably unlink.
New findings:

* Extra assertions (for short reads/writes, close() failing) don't fire, but do seem to reduce the repro rate, even though it's a handful of arithmetic ops and a predicatable branch or two in a path that includes a nontrivial syscall, so it shouldn't be throwing off timing all that much.  Needs more data.

* Setting the response socket nonblocking and spamming extra responses until there's an error doesn't help; none of them are received.  This probably wasn't too surprising given previous results.

* The response socket's inode number in the server after it sends the reply is the same as what the client sees immediately after the socketpair() call.  (I modified the broker requests to include the inode number.)  I haven't yet tested the client end of the socket the same way; I could, but I'm doubtful.

* It doesn't appear to be a soundness bug in the Linux kernel's socket GC.  (Because sockets can be held in queued SCM_RIGHTS messages in sockets, there can be reference cycles; the kernel has a cycle collector that breaks cycles by, if I understand correctly, removing and destroying the socket buffer.)  This was tested by using ftrace to log calls to kfree_skb; a call from unix_gc can be seen when running a test program that creates a one-socket cycle, but not when the gtests fail.  (kfree_skb doesn't appear to be called for the normal socket destruction caused by the broker, so the log is low-traffic enough to read manually — about 19 calls per test run — and doesn't appear to have lost messages due to buffer overrun).

* I also tried doing a second recvmsg on both sides and asserting that they got EOF, to try to detect messages possibly being sent to the wrong socket somehow.  To make that not deadlock, I also added shutdown(fd, SHUT_WR) calls.  With that build, the bug hasn't reproduced once in 6.0e9 broker calls.

If that last result holds up, and if I can troubleshoot it down to a smaller part of the change… I don't *like* making changes without understanding why, but I could.
I wrote a small C program to try to reproduce this without the overhead of Gecko.  It works *very* well — it takes about 60 runs (60 million RPCs) per failure on average, but with a lot of variance (which is expected for an exponential distribution).  This is partially due to giving each client a separate dup() of the client fd; if I don't do that, it took 448 runs before the first failure.  That may be due to lock contention on the fd vs. just on the underlying socket changing the timing.

It doesn't seem to reproduce at all (not observed in 3000 runs) if confined to a single core.  It also didn't reproduce in 136 runs under rr chaos mode (where it's 60x slower).

Adding a SHUT_WR on the response socket after sending and before closing reduces but doesn't eliminate it: the first failure was after 300 runs.

So I added a sched_yield in that spot, and it lasted 1171 runs.  This suggests experimenting with ensuring that the last close on the sending end of the response socket happens-after the receive — sending a dup of it back to the client in a second message, or doing a blocking read on it (expecting EOF) before closing.
Assignee: nobody → jld
I tested with a change to make the server wait for EOF from the client, and that seems to avoid the bug (no failures in 4 billion RPCs).  It also doubles the wall time, although it might not affect latency in practice if we're not getting multithreaded back-to-back requests.

The approach of sending the socket back to the client (in a message that will either be dropped on client close, or fail to be sent if the client closes first) also holds up: it ran for 14.4 billion RPCs without failing.  That change actually *decreases* wall time slightly for some reason, although total CPU time increased somewhat.

For comparison, I also tried running the unmodified test case on FreeBSD; it ran for 13.8 billion RPCs with no failures.
As for *why* this is happening, I think I finally have an explanation:

1. (server) recvmsg calls __skb_try_recv_datagram, sees no pending data, and prepares to sleep
2. (client) sendmsg calls skb_queue_tail to append the packet to the peer's receive queue
3. (client) close → fput → unix_release_sock sets the peer's sk_shutdown field to the equivalent of SHUT_RDWR
4. (server) recvmsg calls __skb_wait_for_more_packets, which reads sk_shutdown and indicates EOF

Corresponding source lines:

[1] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/unix/af_unix.c#L2098
[2] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/unix/af_unix.c#L1797
[3] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/unix/af_unix.c#L540
[4] https://github.com/torvalds/linux/blob/06dd3dfeea60e2a6457a6aedf97afc8e6d2ba497/net/core/datagram.c#L102
See Also: → 1459553
Priority: P3 → P1
Summary: Intermittent SandboxBrokerTest.MultiThreadOpen or SandboxBrokerTest.MultiThreadStat | Expected: (zerofd) >= (0) or (nullfd) >= (0), actual: -5 vs 0 | test completed (time: 1198ms) → Intermittent failure to receive sandbox broker responses (SandboxBrokerTest.MultiThreadOpen, SandboxBrokerTest.MultiThreadStat, or “Sandbox: Unexpected EOF” by itself near non-shutdown crashes)
See Also: 1421201
See Also: 1420475
See Also: 1442989
See Also: 1439140
See Also: 1459553
No longer blocks: 1317084
Back when I posted comment #48, I was planning to report the bug upstream first and seek input on what workaround we should use, but then other things intervened.  Instead, what I'm going to do now is land the workaround technique that's been reasonably well verified with the minimal test case, so that things stop breaking, and if later it's decided that there's a better way then we can always change it.
Comment on attachment 8993568 [details]
Bug 1243108 - Work around Linux kernel bug causing broker responses to be dropped.

https://reviewboard.mozilla.org/r/258280/#review265980
Attachment #8993568 - Flags: review?(gpascutto) → review+
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ecae69f91831
Work around Linux kernel bug causing broker responses to be dropped. r=gcp
https://hg.mozilla.org/mozilla-central/rev/ecae69f91831
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
New occurrence:

https://treeherder.mozilla.org/logviewer.html#?job_id=193734658&repo=mozilla-beta
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The patch wasn't uplifted to Beta, so it's expected to still see failures there.  It's 3 weeks before release, and this has been broken since 54, so it probably wouldn't be accepted for uplift.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=195441939&repo=mozilla-central&lineNumber=9479
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Eliza Balazs [:ebalazs_] from comment #70)
> Recent failure log:
> https://treeherder.mozilla.org/logviewer.html#?job_id=195441939&repo=mozilla-
> central&lineNumber=9479

This failure has absolutely nothing to do with this bug. It's just bug 1482029.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: