Closed Bug 1440203 Opened 6 years ago Closed 4 years ago

Use memfd_create for shared memory where available.

Categories

(Core :: IPC, enhancement, P2)

60 Branch
Unspecified
Linux
enhancement

Tracking

()

RESOLVED FIXED
84 Branch
Tracking Status
firefox60 --- wontfix
firefox84 --- fixed

People

(Reporter: jld, Assigned: jld)

References

Details

(Keywords: perf-alert)

Attachments

(2 files, 1 obsolete file)

memfd_create is a Linux-specific system call, added in kernel 3.17 (released 2014-10-05), which creates a pseudo-file backed by anonymous memory and not associated with a filesystem, accessible only through the returned file descriptor.

It's helpful for sandboxing, because it doesn't need any access to the actual filesystem.  We're currently proxying these accesses through the parent process, so avoiding that may improve performance.

More interestingly, it should also work around bug 1245239 and related bugs because these files aren't tied to any tmpfs instance's size limit; the memory is accounted for as if calling mmap with MAP_ANON (i.e., like malloc).

However, we'll still need to support older kernels that don't have it, so we'll need to fall back to the existing code.

One minor complication is that there isn't a C wrapper until glibc 2.27, and our build environment would be too old to even have the syscall numbers defined in its headers, so we'd need to hardcode them in shared_memory_posix.cc under ifdefs.  (This is safe; the kernel ABI will not change incompatibly.)
Priority: -- → P3
Assignee: nobody → jld
Depends on: 1439057
Priority: P3 → P2
See Also: 1439057
Comment on attachment 8969713 [details]
Bug 1440203 - Use memfd_create for shared memory on Linux where available.

https://reviewboard.mozilla.org/r/238508/#review244266
Attachment #8969713 - Flags: review?(nfroyd) → review+
Reminding myself to land this after bug 1439057 has had a few days to bake.
Flags: needinfo?(jld)
Bug 1439057 has, to continue the metaphor, exploded in the oven.
Flags: needinfo?(jld)
Blocks: 1472889
No longer blocks: 1472889
Blocks: 1464690
Something I'd like to do for this if possible: create tests to exercise the fallback path, including how it affects sandboxing.

Ideally we'd do smoke tests on each child process type with an env var set to fake no memfd_create support, but that might be difficult.  At a minimum I could write a gtest that forks a process and applies a “utility process” policy (cf. bug 1506291) to it, with some internal flag temporarily set.

One problem here is that, currently, this can cause different paths to be taken in the process actors (see, again, bug 1506291), because a policy like RDD will need brokering if and only if it can't use memfd_create.  That could be fixable with a little more abstraction.

(It's also worth mentioning that we have another piece of pre-3.17-only code that we're not testing: the signal broadcast gadget from bug 970676.  However, nobody has touched that code in years, because there's no reason to, so it's less likely to acquire regressions than the shared memory and sandbox broker/policy code.)

Note to self: modify this filtering, from bug 1510574, to include the memfd pseudo-paths.

Was looking at some of the Chrome's memfd_create code and noticed that there's code to detect memfd_create based on kernel version because of crashes on older versions:
https://source.chromium.org/chromium/chromium/src/+/master:third_party/perfetto/src/tracing/ipc/memfd.cc;l=51;drc=15fd919ebc4222f581aeaec71fc2c63092796ad1

That's not actually Chromium; it's part of an Android performance tool called Perfetto, maintained outside the Chromium repo and linked from it via this script. This particular code appears to have been copied from libartbase.

Note that neither the comments nor the commit message give any detail other than a link to Google's private internal bug tracker, so I don't know what to make of the extraordinary claim that “[s]ome older kernels segfault” given a syscall number that ought to be unassigned, nor how I could try to reproduce that result.

If I had to guess, I'd guess there's some Android device where the vendor added proprietary syscalls without considering compatibility… which would be a problem, because Gecko will probably be run on it as well. But we don't know what device it is, or if that's even what's truly going on.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #8)

If I had to guess, I'd guess there's some Android device where the vendor added proprietary syscalls without considering compatibility… which would be a problem, because Gecko will probably be run on it as well. But we don't know what device it is, or if that's even what's truly going on.

I guess the best course of action is to not worry about it now. If we see crashes with older kernels than we worry about it then.

So here's a problem: if one thread forks while another is trying to freeze shared memory, the child process can inherit a writeable mapping; if the parent unmaps its own mapping and tries to seal before the child execs, the write seal will fail because of the writeable mapping.

Linux almost has a solution for this: MADV_DONTFORK, which opts out the mapping from being inherited in child processes. But that doesn't completely eliminate the problem: there's still a race window between the mmap and the madvise. It does move the entire race window into the shared memory code, so it should be possible to have that synchronize against fork (using pthread_atfork hooks, which sandboxed process launching doesn't support but that's fixable), but adding side effects to forking like that seems risky.

This isn't strictly a blocker — I'm also using procfs to give the child a read-only fd, and sandboxing will prevent it from using procfs to undo that, and this is necessary as a fallback for F_SEAL_FUTURE_WRITE which isn't supported by older kernels — but I'm also uncomfortable with ignoring errors or removing sealing support because of this.

Attachment #8969713 - Attachment is obsolete: true

The patch in bug 1582954 adds a posix_fallocate call to fallibly reserve space up-front and avoid the SIGBUS problem (which failed in bug 1245239 but apparently works now?); it's not clear to me if that's necessary in the memfd case, but it should be harmless.

See Also: → 1582954

memfd_create is an API added in Linux 3.17 (and adopted by FreeBSD
for the upcoming version 13) for creating anonymous shared memory
not connected to any filesystem. Supporting it means that sandboxed
child processes on Linux can create shared memory directly instead of
messaging a broker, which is unavoidably slower, and it should avoid
the problems we'd been seeing with overly small /dev/shm in container
environments (which were causing serious problems for using Firefox for
automated testing of frontend projects).

memfd_create also introduces the related operation of file seals:
irrevocably preventing types of modifications to a file. Unfortunately,
the most useful one, F_SEAL_WRITE, can't be relied on; see the large
comment in SharedMemory:ReadOnlyCopy for details. So we still use
the applicable seals as defense in depth, but read-only copies are
implemented on Linux by using procfs (and see the comments on the
ReadOnlyCopy function in shared_memory_posix.cc for the subtleties
there).

There's also a FreeBSD implementation, using cap_rights_limit for
read-only copies, if the build host is new enough to have the
memfd_create function.

Comment on attachment 9176355 [details]
Bug 1440203 - Support memfd_create in IPC shared memory.

It would be nice to have someone who knows FreeBSD look at the ifdef case I added for its version of memfd_create.

Attachment #9176355 - Flags: feedback?(jbeich)
Attachment #9176355 - Flags: feedback?(greg)

Comment on attachment 9176355 [details]
Bug 1440203 - Support memfd_create in IPC shared memory.

#elif defined(__FreeBSD__) && __FreeBSD_version >= 1300048

__FreeBSD_version is not a predefined macro, so requires either <sys/param.h> or <osreldate.h>. -Werror=undef can catch missing #include but not when a header is bootlegged via another. Given old -CURRENT snapshots are not supported by FreeBSD project maybe switch to:

#elif defined(__FreeBSD__) && __FreeBSD__ >= 13

However, looking at Linux case there's probably a more clean solution.

#  if !defined(__GLIBC__) || __GLIBC__ < 2 || \
      (__GLIBC__ == 2 && __GLIBC_MINOR__ < 27)

Why not AC_CHECK_FUNCS(memfd_create) instead of checking libc/OS version? (See example diff below)

Both musl and bionic (Android) seem to have memfd_create in <sys/mman.h> nowadays:
https://android.googlesource.com/platform/bionic.git/+/3d24d2b0883e%5E!/
https://git.musl-libc.org/cgit/musl/commit/?id=38f2fa3d0207

diff --git a/ipc/chromium/src/base/shared_memory_posix.cc b/ipc/chromium/src/base/shared_memory_posix.cc
index d6fd3358329de..464ad6b090f5d 100644
--- a/ipc/chromium/src/base/shared_memory_posix.cc
+++ b/ipc/chromium/src/base/shared_memory_posix.cc
@@ -83,8 +83,7 @@ SharedMemoryHandle SharedMemory::NULLHandle() { return SharedMemoryHandle(); }
 
 // Linux was the origin of memfd_create, added in kernel 3.17.
 
-#  if !defined(__GLIBC__) || __GLIBC__ < 2 || \
-      (__GLIBC__ == 2 && __GLIBC_MINOR__ < 27)
+#  if !defined(HAVE_MEMFD_CREATE)
 
 // glibc added a wrapper function in 2.27; if the build host is older
 // than that, we'll have to call the syscall directly (see also
@@ -99,8 +98,6 @@ static int memfd_create(const char* name, unsigned int flags) {
 
 #    endif  // have SYS_memfd_create
 
-#  else  // Linux, new glibc; has memfd_create function
-#    define HAVE_MEMFD_CREATE
 #  endif  // glibc version
 
 // To create a read-only duplicate of an fd, we can use procfs; the
@@ -112,14 +109,12 @@ static int DupReadOnly(int fd) {
   return HANDLE_EINTR(open(path.c_str(), O_RDONLY | O_CLOEXEC));
 }
 
-#elif defined(__FreeBSD__) && __FreeBSD_version >= 1300048
+#elif defined(__FreeBSD__) && defined(HAVE_MEMFD_CREATE)
 
 // FreeBSD added memfd_create (using the same infrastructure as its
 // SHM_ANON extension to shm_open) in 13.0, which is still in
 // development at the time of this writing.
 
-#  define HAVE_MEMFD_CREATE
-
 // FreeBSD's Capsicum framework allows irrevocably restricting the
 // operations permitted on a file descriptor.
 
diff --git a/old-configure.in b/old-configure.in
index e9aff5be49d1b..beb9e133ec04e 100644
--- a/old-configure.in
+++ b/old-configure.in
@@ -2791,6 +2791,7 @@ AC_SUBST(NSS_EXTRA_SYMBOLS_FILE)
 
 if test -n "$COMPILE_ENVIRONMENT"; then
 AC_CHECK_FUNCS(posix_fadvise posix_fallocate)
+AC_CHECK_FUNCS(memfd_create) # Linux, FreeBSD
 
 dnl Check for missing components
 if test "$MOZ_X11"; then

Comment on attachment 9176355 [details]
Bug 1440203 - Support memfd_create in IPC shared memory.

Does work well on FreeBSD, including with my capsicum sandboxing patch (bug 1607980). Agree with Jan on the ifdef details.

Attachment #9176355 - Flags: feedback?(greg) → feedback+
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bb512f5fdeda
Prelude: ignore memfd for the "main thread I/O" test. r=Gijs
https://hg.mozilla.org/integration/autoland/rev/9366b15ee97c
Support memfd_create in IPC shared memory. r=glandium

Backed out for Valgrind bustages.

See https://bugzilla.mozilla.org/show_bug.cgi?id=1662564#c4

Looks like Valgrind understands memfd_create but not seals. That should be easy enough to fix.

Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/468878422866
Prelude: ignore memfd for the "main thread I/O" test. r=Gijs
https://hg.mozilla.org/integration/autoland/rev/0b10bf76fe35
Support memfd_create in IPC shared memory. r=glandium
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 83 Branch
Regressions: 1670277
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 83 Branch → ---
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ab11665d8607
Prelude: ignore memfd for the "main thread I/O" test. r=Gijs
https://hg.mozilla.org/integration/autoland/rev/6e44c037b2dc
Support memfd_create in IPC shared memory. r=glandium
Backout by dluca@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b953dc378344
Backed out 2 changesets for Backout conflicts with Bug 1654103. CLOSED TREE
Pushed by jedavis@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5d24d236fbb0
Prelude: ignore memfd for the "main thread I/O" test. r=Gijs
https://hg.mozilla.org/integration/autoland/rev/e13d2bfa8a79
Support memfd_create in IPC shared memory. r=glandium
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 84 Branch

== Change summary for alert #27364 (as of Mon, 26 Oct 2020 11:13:39 GMT) ==

Improvements:

Ratio Suite Test Platform Options Absolute values (old vs new)
3% glterrain linux64-shippable e10s stylo 9.25 -> 8.94
3% tsvg_static linux64-shippable e10s stylo 52.75 -> 51.12
3% tsvg_static linux64-shippable e10s stylo 52.76 -> 51.14
2% rasterflood_svg linux64-shippable e10s stylo 9,755.41 -> 9,516.01

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=27364

Keywords: perf-alert
Regressions: 1708576
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: