1259512 - (e10s-oom) [e10s] significantly higher rates of OOM crashes in the content process of Firefox with e10s than in the main process of non-e10s

Reporter

Description

•

8 years ago

Based on the recent e10s experiments on beta, there is a much higher incidence of OOM crashes with e10s enabled than with e10s disabled. It's silly to file a separate bug for each OOM crash signature, so instead I'm filing this single bug.

Our stability problems are primarily in the content process, not in the chrome process.

Analysis of e10s crash rates:
https://github.com/vitillo/e10s_analyses/blob/master/beta46-noapz/e10s-stability-analysis.ipynb

Breakdown of signatures that affect e10s:
https://gist.github.com/bsmedberg/f23e84ae4021a1cc3bcf

There are many things I don't know yet about this problem:
* Platforms affected
* Are we running out of real memory (physical+swap) or VM? Or just fragmenting our address space to death?
* Does this primarily affect certain subgroups, such as people with acceleration on/off or certain graphics drivers?

Running out of real memory could hopefully be figured out with about:memory.

Running out of VM might be caused by a bug in mapping shared memory sections across the IPC boundary, or other bugs (graphics drivers have caused problems like this in the past).

Related bugs:
bug 1250672 OSX does not reclaim memory properly
bug 1257486 - add more memory annotations to content process crash reports
bug 1236108 and followup bug 1256541 - make OOMAllocationSize work in content process crash reports
bug 1259358 - nsITimer sometimes doesn't work with e10s (could be causing GC/CC scheduling problems?)

Please DO dup any small-OOM crash signature bugs which are specific to e10s here.
Please DO NOT dup large-OOM crash signatures to this bug.

Benjamin Smedberg

Reporter

Updated

•

8 years ago

Alias: e10s-oom

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Whiteboard: [MemShrink]

Andrew McCreight [:mccr8]

Assignee

Comment 3

•

8 years ago

Thanks for filing this. I was noticing something might be awry here myself yesterday. For instance, in the control group there are 124 crashes in mozilla::CycleCollectedJSRuntime::CycleCollectedJSRuntime, but with e10s there are 192, which seems like an alarming difference.

The only bad e10s-specific leak (rather than just bloat) I'm aware of is bug 1252677, which looks like some kind of Windows-specific SharedMemory/PTextureChild. However, Bas said the leaking tests are related to "drawing video to a Canvas", which doesn't sound like something you'd see very often in regular web browsing.

Benjamin Smedberg

Reporter

Comment 4

•

8 years ago

Yeah, it's worse than 124/192, because content process crashes have a 10% submission rate while chrome process crashes have a 50% submission rate.

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1256541

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1257486

Jonathan Howard

Comment 7

•

8 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #4)
> Yeah, it's worse than 124/192, because content process crashes have a 10%
> submission rate while chrome process crashes have a 50% submission rate.
Without knowing how many are not reported on shutdown it is hard to gauge if there is such a big change. To me signatures reported seems about the same (but just from ad hoc viewing. Startup crashes get reported more.)

js::AutoEnterOOMUnsafeRegion::crash seems to be the big increase. (Messy due to signature change.)
mozilla::CycleCollectedJSRuntime::CycleCollectedJSRuntime only 46 (With any luck fix is bug 1247122)

45 vs 46 e10s comparison.
https://crash-analysis.mozilla.com/rkaiser/datil/searchcompare/?common=process_type%3Dbrowser%26process_type%3Dcontent%26ActiveExperimentBranch%3D%253Dexperiment-no-addons&p1=ActiveExperiment%3D%253De10s-beta45-withoutaddons%2540experiments.mozilla.org%26date%3D%3E%253D2016-02-12%26date%3D%3C2016-02-25&p2=ActiveExperiment%3D%253De10s-beta46-noapz%2540experiments.mozilla.org%26date%3D%3E%253D2016-03-11%26date%3D%3C2016-03-25

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1259187

Jim Mathies [:jimm]

Updated

•

8 years ago

Depends on: 1259183

Jim Mathies [:jimm]

Updated

•

8 years ago

Depends on: 1235633

Jim Mathies [:jimm]

Updated

•

8 years ago

Depends on: 1257387

Brad Lassey [:blassey] (use needinfo?)

Updated

•

8 years ago

tracking-e10s: --- → +

Priority: -- → P1

Ting-Yu Chou [:ting] (away)

Comment 8

•

8 years ago

Could it be IPC to make the heap fragmentation become more serious, so things like 1) what :billm is doing in bug 1235633 comment 12 to avoid a copy, or 2) reuse message buffers would help?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 9

•

8 years ago

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #0)
> Based on the recent e10s experiments on beta, there is a much higher
> incidence of OOM crashes with e10s enabled than with e10s disabled.

Is this true? With e10s disabled, "OOM | small" accounts for about 7% of all crashes in beta 46. With e10s enabled, it's 0.55%. Granted, there are some specific OOM signatures in e10s that are significant. But it still seems like e10s has fewer OOMs overall. Am I misreading the data? I know that the way the crash annotations work is really complex.

Andrew McCreight [:mccr8]

Assignee

Comment 10

•

8 years ago

(In reply to Bill McCloskey (:billm) from comment #9)
> Am I misreading the data?

Bug 1256541 only landed on beta yesterday, so we do not have crash signatures with proper OOM annotations for content processes from beta yet. To get an approximation of what OOM | small would be you have to add up the various OOM | unknown signatures, like those in the two duped bugs.

Andrew McCreight [:mccr8]

Assignee

Comment 11

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #8)
> Could it be IPC to make the heap fragmentation become more serious,

Yes, that is the best guess I have so far. However, telemetry suggests that VSIZE_MAX_CONTIGUOUS is maybe a little better with e10s enabled, which contradicts that theory[1]. VSIZE_MAX_CONTIGUOUS is the largest contiguous amount of address space in the process, and tends to get really low when there is severe heap fragmentation. It does seem like it is always fairly low for some people, around 800kb in the chart, so maybe many users are just in a precarious state that results in a crash when IPC ends up allocating a large block of memory.

[1] https://github.com/vitillo/e10s_analyses/blob/master/beta45-withoutaddons/e10s_experiment.ipynb

I think we can analyze crash minidumps to figure out what exactly the address space looks like. There was some previous work along these lines for desktop Firefox in bug 1001760, and hopefully we can reuse those analyses.

> so things like 1) what :billm is doing in bug 1235633 comment 12 to avoid a
> copy, or 2) reuse message buffers would help?

Yes, I think those are worth trying. Also, the Pickle::Resize() method currently uses a doubling strategy, which could be bad if the size is nearing that of our largest contiguous block. There's a bug on file for it but I don't remember which.

Nicholas Nethercote [inactive]

Comment 12

•

8 years ago

> Also, the Pickle::Resize() method
> currently uses a doubling strategy, which could be bad if the size is
> nearing that of our largest contiguous block. There's a bug on file for it
> but I don't remember which.

It's bug 1253131. I said I would take a look but I haven't got around to it yet. I'll probably get to it, maybe early next week, though I'd be happy if someone else took a look in the meantime.

Ting-Yu Chou [:ting] (away)

Comment 13

•

8 years ago

I was wondering are there any information jemalloc can show us, but then realized the features like profiling [1] and measuring external fragmentation (stats.active) [2] are not existed in mozjemalloc.

The best I can get is:

  ___ Begin malloc statistics ___
  Assertions disabled
  Boolean MALLOC_OPTIONS: aCjPz
  Max arenas: 1
  Pointer size: 8
  Quantum size: 16
  Max small size: 512
  Max dirty pages per arena: 256
  Chunk size: 1048576 (2^20)
  Allocated: 47656384, mapped: 110100480
  huge: nmalloc      ndalloc    allocated
             27           26      2097152

  arenas[0]:
  dirty: 136 pages dirty, 204 sweeps, 4659 madvises, 38082 pages purged
              allocated      nmalloc      ndalloc
  small:       31968704      1411295      1132137
  large:       13590528        33356        31853
  total:       45559232      1444651      1163990
  mapped:     106954752
  bins:     bin   size regs pgs  requests   newruns    reruns maxruns curruns
              0 T    8  500   1     76882        43       422      21      21
              1 Q   16  252   1    222439       543      4460     276     143
              2 Q   32  126   1    320891      1608     10905     985     766
              3 Q   48   84   1    145888      1020      7771     652     568

Some notes:

  - seems no one is actively working on bug 762449
  - [3] mentioned "virtual memory fragmentation", sounds like the VSIZE_MAX_CONTIGUOUS :mccr8 mentioned in comment 11

[1] https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Profiling
[2] http://blog.gmane.org/gmane.comp.lib.jemalloc/month=20140501
[3] http://www.canonware.com/pipermail/jemalloc-discuss/2013-April/000572.html

Ting-Yu Chou [:ting] (away)

Comment 14

•

8 years ago

(In reply to Andrew McCreight [:mccr8] from comment #11)
> I think we can analyze crash minidumps to figure out what exactly the
> address space looks like. There was some previous work along these lines for
> desktop Firefox in bug 1001760, and hopefully we can reuse those analyses.

I tried the tool minidump-memorylist with the latest google breakpad, and it crashed. Somehow GetMemoryInfoList() [1] returns null. :(

[1] https://github.com/bsmedberg/minidump-memorylist/blob/master/minidump-memorylist.cc#L66

Nicholas Nethercote [inactive]

Comment 15

•

8 years ago

> profiling [1]

You can use DMD's "live" mode to do generic heap profiling. See the docs at https://developer.mozilla.org/en-US/docs/Mozilla/Performance/DMD

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1253131

Bill McCloskey [inactive unless it's an emergency] (:billm)

Updated

•

8 years ago

Depends on: 1260908

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1238707

(not currently active) Ted Mielczarek

Comment 16

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #14)
> (In reply to Andrew McCreight [:mccr8] from comment #11)
> > I think we can analyze crash minidumps to figure out what exactly the
> > address space looks like. There was some previous work along these lines for
> > desktop Firefox in bug 1001760, and hopefully we can reuse those analyses.
> 
> I tried the tool minidump-memorylist with the latest google breakpad, and it
> crashed. Somehow GetMemoryInfoList() [1] returns null. :(
> 
> [1]
> https://github.com/bsmedberg/minidump-memorylist/blob/master/minidump-
> memorylist.cc#L66

This only works on dumps produced on Windows, FYI. We could make it work for Linux as well (Linux minidumps include /proc/self/maps). Mac minidumps do not include memory mapping info, IIRC.

Benjamin Smedberg

Reporter

Updated

•

8 years ago

Assignee: nobody → continuation

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Whiteboard: [MemShrink] → [MemShrink:meta]

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1262661

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1262918

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

No longer depends on: 1262661

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

No longer depends on: 1253131

Andrew Overholt [:overholt]

Updated

•

8 years ago

Whiteboard: [MemShrink:meta] → [MemShrink:meta] btpp-active

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1263774

Jim Mathies [:jimm]

Updated

•

8 years ago

Depends on: 1259480

Jim Mathies [:jimm]

Updated

•

8 years ago

Depends on: 1263916

Jim Mathies [:jimm]

Updated

•

8 years ago

Depends on: 1229384

Jim Mathies [:jimm]

Updated

•

8 years ago

No longer depends on: 1259480

Ting-Yu Chou [:ting] (away)

Comment 17

•

8 years ago

:mccr8, are there anything I can help? I am not sure what bugs you are not working on and are with higher priority.

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Comment 18

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #17)
> :mccr8, are there anything I can help? I am not sure what bugs you are not
> working on and are with higher priority.

I'm only working on bug 1253131 right now (bug 1263235 is just waiting for a review). I'm not really sure what the priority should be for the various bugs. We still don't have a great idea of what is causing OOMs, aside from large messages likely causing problems with contiguous address space. Bug 1262671 would be good to have, but it may take a little while, so something shorter term might be better right now. I'm not sure.

Flags: needinfo?(continuation)

Ting-Yu Chou [:ting] (away)

Comment 19

•

8 years ago

(In reply to Andrew McCreight [:mccr8] from comment #18)
> large messages likely causing problems with contiguous address space. Bug
> 1262671 would be good to have, but it may take a little while, so something
> shorter term might be better right now. I'm not sure.

Then I'll see if I can get some data that :dmajor did in bug 1001760.

Ting-Yu Chou [:ting] (away)

Comment 20

•

8 years ago

minidump-memorylist crashed because GetMemoryInfoList() returns null. The message:

  2016-04-13 16:58:58: minidump.cc:4765: INFO: GetStream: type 16 not present

means it can not find the list of information about mapped memory regions for a process from the dump file.

(not currently active) Ted Mielczarek

Comment 21

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #20)
> minidump-memorylist crashed because GetMemoryInfoList() returns null. The
> message:
> 
>   2016-04-13 16:58:58: minidump.cc:4765: INFO: GetStream: type 16 not present
> 
> means it can not find the list of information about mapped memory regions
> for a process from the dump file.

Was this dump from a Windows system? If not, see comment 16.

(not currently active) Ted Mielczarek

Comment 22

•

8 years ago

Oh, crud. Apparently we're not writing memory info to minidumps from child processes:
https://dxr.mozilla.org/mozilla-central/rev/21bf1af375c1fa8565ae3bb2e89bd1a0809363d4/toolkit/crashreporter/nsExceptionHandler.cpp#3485

vs. the in-process case:
https://dxr.mozilla.org/mozilla-central/rev/21bf1af375c1fa8565ae3bb2e89bd1a0809363d4/toolkit/crashreporter/nsExceptionHandler.cpp#1545

Ting-Yu Chou [:ting] (away)

Comment 23

•

8 years ago

Great, I was looking around for what MiniDumpNormal dumps. Would you fix it?

(not currently active) Ted Mielczarek

Updated

•

8 years ago

Depends on: 1264242

(not currently active) Ted Mielczarek

Comment 24

•

8 years ago

I'll fix this, it should be a straightforward patch.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 25

•

8 years ago

Is it bug 1263774?

(not currently active) Ted Mielczarek

Comment 26

•

8 years ago

No, that's for "memory reports", which are the content of about:memory. The "memory info stream" is bug 1264242. (These are confusingly similar, aren't they?)

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1110596

Benjamin Smedberg

Reporter

Updated

•

8 years ago

Depends on: 1263028

Benjamin Smedberg

Reporter

Updated

•

8 years ago

Depends on: 1265015

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1265902

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1264161

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1266517

Benjamin Smedberg

Reporter

Updated

•

8 years ago

Depends on: 1267329

Jim Mathies [:jimm]

Updated

•

8 years ago

Depends on: 1269365

Ting-Yu Chou [:ting] (away)

Comment 27

•

8 years ago

Attached file oomdata.txt — Details

Analysed the minidumps from 47.0b1 and has crash signature [@ OOM | small] on content process. See bug 1001760 comment 4 for the meaning of each field.

Basically it suggests bug 1005844 would be helpful.

Andrew McCreight [:mccr8]

Assignee

Comment 28

•

8 years ago

jimm noticed that a lot of the OOM small crashes are happening in IPC code: http://tinyurl.com/hxkcs79
(If you add "proto signature" as a facet to your super search then it shows the actual signatures.)

It might be that this is because memory usage is higher while we are in the middle of dealing with IPC, due to all of the allocations needed to serialize and deserialize. If we could reduce that memory spike, we might be able to reduce the amount of OOM crashes.

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1272018

Ting-Yu Chou [:ting] (away)

Comment 29

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #27)
> Created attachment 8749567 [details]
> oomdata.txt

I was wondering why there're so many tiny (<1M) blocks, by using minidump-memorylist, I saw many regions are with memory protection PAGE_WRITECOMBINE and type MD_MEMORY_TYPE_MAPPED, not sure what are they for:

BaseAddress	AllocationBase	AllocationProtect	RegionSize	State	Protect	Type
7fd60000	7fd60000	404	                8000	        1000	404	40000
7fd68000	0	        0	                8000	        10000	1	0
7fd70000	7fd70000	404	                8000	        1000	404	40000
7fd78000	0	        0	                8000     	10000	1	0
7fd80000	7fd80000	404              	1c000	        1000	404  	40000

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Depends on: 1273685

Ting-Yu Chou [:ting] (away)

Comment 30

•

8 years ago

Will we switch to 64bit firefox whenever a user runs 64bit OS, or can we make that happen?

Ting-Yu Chou [:ting] (away)

Comment 31

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #29)
> I was wondering why there're so many tiny (<1M) blocks, by using
> minidump-memorylist, I saw many regions are with memory protection
> PAGE_WRITECOMBINE and type MD_MEMORY_TYPE_MAPPED, not sure what are they for:

I placed breakpoints at VirtualAlloc* see if they can be hit with PAGE_WRITECOMBINE set, but no luck. I don't know, it could be allocated by drivers or something else. Another thing I noticed could cause fragmentation is random address allocation in js::jit::ExecutableAllocator::systemAlloc() [1] which is for security, see bug 677272.

[1] https://dxr.mozilla.org/mozilla-central/rev/c4449eab07d39e20ea315603f1b1863eeed7dcfe/js/src/jit/ExecutableAllocatorWin.cpp#226

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

No longer depends on: 1273685

Jim Mathies [:jimm]

Updated

•

8 years ago

Depends on: 1274706

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

No longer depends on: 1274706

Andrew McCreight [:mccr8]

Assignee

Comment 32

•

8 years ago

Removing some fairly generic OOM signatures from blocking this bug.

No longer depends on: 1229384, 1257387, 1263916

Jim Mathies [:jimm]

Updated

•

8 years ago

Priority: P1 → --

Andrew McCreight [:mccr8]

Assignee

Comment 33

•

8 years ago

I think we can call this fixed, though of course there are still remaining memory improvements that could be made.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla48