Closed Bug 1259512 (e10s-oom) Opened 8 years ago Closed 8 years ago

[e10s] significantly higher rates of OOM crashes in the content process of Firefox with e10s than in the main process of non-e10s

Categories

(Core :: DOM: Content Processes, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
e10s + ---

People

(Reporter: benjamin, Assigned: mccr8)

References

(Depends on 1 open bug)

Details

(Whiteboard: [MemShrink:meta] btpp-active)

Attachments

(1 file)

Based on the recent e10s experiments on beta, there is a much higher incidence of OOM crashes with e10s enabled than with e10s disabled. It's silly to file a separate bug for each OOM crash signature, so instead I'm filing this single bug.

Our stability problems are primarily in the content process, not in the chrome process.

Analysis of e10s crash rates:
https://github.com/vitillo/e10s_analyses/blob/master/beta46-noapz/e10s-stability-analysis.ipynb

Breakdown of signatures that affect e10s:
https://gist.github.com/bsmedberg/f23e84ae4021a1cc3bcf

There are many things I don't know yet about this problem:
* Platforms affected
* Are we running out of real memory (physical+swap) or VM? Or just fragmenting our address space to death?
* Does this primarily affect certain subgroups, such as people with acceleration on/off or certain graphics drivers?

Running out of real memory could hopefully be figured out with about:memory.

Running out of VM might be caused by a bug in mapping shared memory sections across the IPC boundary, or other bugs (graphics drivers have caused problems like this in the past).

Related bugs:
bug 1250672 OSX does not reclaim memory properly
bug 1257486 - add more memory annotations to content process crash reports
bug 1236108 and followup bug 1256541 - make OOMAllocationSize work in content process crash reports
bug 1259358 - nsITimer sometimes doesn't work with e10s (could be causing GC/CC scheduling problems?)

Please DO dup any small-OOM crash signature bugs which are specific to e10s here.
Please DO NOT dup large-OOM crash signatures to this bug.
Alias: e10s-oom
Whiteboard: [MemShrink]
Thanks for filing this. I was noticing something might be awry here myself yesterday. For instance, in the control group there are 124 crashes in mozilla::CycleCollectedJSRuntime::CycleCollectedJSRuntime, but with e10s there are 192, which seems like an alarming difference.

The only bad e10s-specific leak (rather than just bloat) I'm aware of is bug 1252677, which looks like some kind of Windows-specific SharedMemory/PTextureChild. However, Bas said the leaking tests are related to "drawing video to a Canvas", which doesn't sound like something you'd see very often in regular web browsing.
Yeah, it's worse than 124/192, because content process crashes have a 10% submission rate while chrome process crashes have a 50% submission rate.
Depends on: 1256541
Depends on: 1257486
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #4)
> Yeah, it's worse than 124/192, because content process crashes have a 10%
> submission rate while chrome process crashes have a 50% submission rate.
Without knowing how many are not reported on shutdown it is hard to gauge if there is such a big change. To me signatures reported seems about the same (but just from ad hoc viewing. Startup crashes get reported more.)

js::AutoEnterOOMUnsafeRegion::crash seems to be the big increase. (Messy due to signature change.)
mozilla::CycleCollectedJSRuntime::CycleCollectedJSRuntime only 46 (With any luck fix is bug 1247122)

45 vs 46 e10s comparison.
https://crash-analysis.mozilla.com/rkaiser/datil/searchcompare/?common=process_type%3Dbrowser%26process_type%3Dcontent%26ActiveExperimentBranch%3D%253Dexperiment-no-addons&p1=ActiveExperiment%3D%253De10s-beta45-withoutaddons%2540experiments.mozilla.org%26date%3D%3E%253D2016-02-12%26date%3D%3C2016-02-25&p2=ActiveExperiment%3D%253De10s-beta46-noapz%2540experiments.mozilla.org%26date%3D%3E%253D2016-03-11%26date%3D%3C2016-03-25
Depends on: 1259187
Depends on: 1259183
Depends on: 1235633
Depends on: 1257387
tracking-e10s: --- → +
Priority: -- → P1
Could it be IPC to make the heap fragmentation become more serious, so things like 1) what :billm is doing in bug 1235633 comment 12 to avoid a copy, or 2) reuse message buffers would help?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #0)
> Based on the recent e10s experiments on beta, there is a much higher
> incidence of OOM crashes with e10s enabled than with e10s disabled.

Is this true? With e10s disabled, "OOM | small" accounts for about 7% of all crashes in beta 46. With e10s enabled, it's 0.55%. Granted, there are some specific OOM signatures in e10s that are significant. But it still seems like e10s has fewer OOMs overall. Am I misreading the data? I know that the way the crash annotations work is really complex.
(In reply to Bill McCloskey (:billm) from comment #9)
> Am I misreading the data?

Bug 1256541 only landed on beta yesterday, so we do not have crash signatures with proper OOM annotations for content processes from beta yet. To get an approximation of what OOM | small would be you have to add up the various OOM | unknown signatures, like those in the two duped bugs.
(In reply to Ting-Yu Chou [:ting] from comment #8)
> Could it be IPC to make the heap fragmentation become more serious,

Yes, that is the best guess I have so far. However, telemetry suggests that VSIZE_MAX_CONTIGUOUS is maybe a little better with e10s enabled, which contradicts that theory[1]. VSIZE_MAX_CONTIGUOUS is the largest contiguous amount of address space in the process, and tends to get really low when there is severe heap fragmentation. It does seem like it is always fairly low for some people, around 800kb in the chart, so maybe many users are just in a precarious state that results in a crash when IPC ends up allocating a large block of memory.

[1] https://github.com/vitillo/e10s_analyses/blob/master/beta45-withoutaddons/e10s_experiment.ipynb

I think we can analyze crash minidumps to figure out what exactly the address space looks like. There was some previous work along these lines for desktop Firefox in bug 1001760, and hopefully we can reuse those analyses.

> so things like 1) what :billm is doing in bug 1235633 comment 12 to avoid a
> copy, or 2) reuse message buffers would help?

Yes, I think those are worth trying. Also, the Pickle::Resize() method currently uses a doubling strategy, which could be bad if the size is nearing that of our largest contiguous block. There's a bug on file for it but I don't remember which.
> Also, the Pickle::Resize() method
> currently uses a doubling strategy, which could be bad if the size is
> nearing that of our largest contiguous block. There's a bug on file for it
> but I don't remember which.

It's bug 1253131. I said I would take a look but I haven't got around to it yet. I'll probably get to it, maybe early next week, though I'd be happy if someone else took a look in the meantime.
I was wondering are there any information jemalloc can show us, but then realized the features like profiling [1] and measuring external fragmentation (stats.active) [2] are not existed in mozjemalloc.

The best I can get is:

  ___ Begin malloc statistics ___
  Assertions disabled
  Boolean MALLOC_OPTIONS: aCjPz
  Max arenas: 1
  Pointer size: 8
  Quantum size: 16
  Max small size: 512
  Max dirty pages per arena: 256
  Chunk size: 1048576 (2^20)
  Allocated: 47656384, mapped: 110100480
  huge: nmalloc      ndalloc    allocated
             27           26      2097152

  arenas[0]:
  dirty: 136 pages dirty, 204 sweeps, 4659 madvises, 38082 pages purged
              allocated      nmalloc      ndalloc
  small:       31968704      1411295      1132137
  large:       13590528        33356        31853
  total:       45559232      1444651      1163990
  mapped:     106954752
  bins:     bin   size regs pgs  requests   newruns    reruns maxruns curruns
              0 T    8  500   1     76882        43       422      21      21
              1 Q   16  252   1    222439       543      4460     276     143
              2 Q   32  126   1    320891      1608     10905     985     766
              3 Q   48   84   1    145888      1020      7771     652     568

Some notes:

  - seems no one is actively working on bug 762449
  - [3] mentioned "virtual memory fragmentation", sounds like the VSIZE_MAX_CONTIGUOUS :mccr8 mentioned in comment 11

[1] https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Profiling
[2] http://blog.gmane.org/gmane.comp.lib.jemalloc/month=20140501
[3] http://www.canonware.com/pipermail/jemalloc-discuss/2013-April/000572.html
(In reply to Andrew McCreight [:mccr8] from comment #11)
> I think we can analyze crash minidumps to figure out what exactly the
> address space looks like. There was some previous work along these lines for
> desktop Firefox in bug 1001760, and hopefully we can reuse those analyses.

I tried the tool minidump-memorylist with the latest google breakpad, and it crashed. Somehow GetMemoryInfoList() [1] returns null. :(

[1] https://github.com/bsmedberg/minidump-memorylist/blob/master/minidump-memorylist.cc#L66
> profiling [1]

You can use DMD's "live" mode to do generic heap profiling. See the docs at https://developer.mozilla.org/en-US/docs/Mozilla/Performance/DMD
Depends on: 1253131
Depends on: 1238707
(In reply to Ting-Yu Chou [:ting] from comment #14)
> (In reply to Andrew McCreight [:mccr8] from comment #11)
> > I think we can analyze crash minidumps to figure out what exactly the
> > address space looks like. There was some previous work along these lines for
> > desktop Firefox in bug 1001760, and hopefully we can reuse those analyses.
> 
> I tried the tool minidump-memorylist with the latest google breakpad, and it
> crashed. Somehow GetMemoryInfoList() [1] returns null. :(
> 
> [1]
> https://github.com/bsmedberg/minidump-memorylist/blob/master/minidump-
> memorylist.cc#L66

This only works on dumps produced on Windows, FYI. We could make it work for Linux as well (Linux minidumps include /proc/self/maps). Mac minidumps do not include memory mapping info, IIRC.
Assignee: nobody → continuation
Whiteboard: [MemShrink] → [MemShrink:meta]
Depends on: 1262661
Depends on: 1262918
No longer depends on: 1262661
No longer depends on: 1253131
Whiteboard: [MemShrink:meta] → [MemShrink:meta] btpp-active
Depends on: 1263774
Depends on: 1259480
Depends on: 1263916
Depends on: 1229384
No longer depends on: 1259480
:mccr8, are there anything I can help? I am not sure what bugs you are not working on and are with higher priority.
Flags: needinfo?(continuation)
(In reply to Ting-Yu Chou [:ting] from comment #17)
> :mccr8, are there anything I can help? I am not sure what bugs you are not
> working on and are with higher priority.

I'm only working on bug 1253131 right now (bug 1263235 is just waiting for a review). I'm not really sure what the priority should be for the various bugs. We still don't have a great idea of what is causing OOMs, aside from large messages likely causing problems with contiguous address space. Bug 1262671 would be good to have, but it may take a little while, so something shorter term might be better right now. I'm not sure.
Flags: needinfo?(continuation)
(In reply to Andrew McCreight [:mccr8] from comment #18)
> large messages likely causing problems with contiguous address space. Bug
> 1262671 would be good to have, but it may take a little while, so something
> shorter term might be better right now. I'm not sure.

Then I'll see if I can get some data that :dmajor did in bug 1001760.
minidump-memorylist crashed because GetMemoryInfoList() returns null. The message:

  2016-04-13 16:58:58: minidump.cc:4765: INFO: GetStream: type 16 not present

means it can not find the list of information about mapped memory regions for a process from the dump file.
(In reply to Ting-Yu Chou [:ting] from comment #20)
> minidump-memorylist crashed because GetMemoryInfoList() returns null. The
> message:
> 
>   2016-04-13 16:58:58: minidump.cc:4765: INFO: GetStream: type 16 not present
> 
> means it can not find the list of information about mapped memory regions
> for a process from the dump file.

Was this dump from a Windows system? If not, see comment 16.
Great, I was looking around for what MiniDumpNormal dumps. Would you fix it?
I'll fix this, it should be a straightforward patch.
No, that's for "memory reports", which are the content of about:memory. The "memory info stream" is bug 1264242. (These are confusingly similar, aren't they?)
Depends on: 1110596
Depends on: 1263028
Depends on: 1265015
Depends on: 1265902
Depends on: 1264161
Depends on: 1266517
Depends on: 1267329
Depends on: 1269365
Attached file oomdata.txt
Analysed the minidumps from 47.0b1 and has crash signature [@ OOM | small] on content process. See bug 1001760 comment 4 for the meaning of each field.

Basically it suggests bug 1005844 would be helpful.
jimm noticed that a lot of the OOM small crashes are happening in IPC code: http://tinyurl.com/hxkcs79
(If you add "proto signature" as a facet to your super search then it shows the actual signatures.)

It might be that this is because memory usage is higher while we are in the middle of dealing with IPC, due to all of the allocations needed to serialize and deserialize. If we could reduce that memory spike, we might be able to reduce the amount of OOM crashes.
Depends on: 1272018
(In reply to Ting-Yu Chou [:ting] from comment #27)
> Created attachment 8749567 [details]
> oomdata.txt

I was wondering why there're so many tiny (<1M) blocks, by using minidump-memorylist, I saw many regions are with memory protection PAGE_WRITECOMBINE and type MD_MEMORY_TYPE_MAPPED, not sure what are they for:

BaseAddress	AllocationBase	AllocationProtect	RegionSize	State	Protect	Type
7fd60000	7fd60000	404	                8000	        1000	404	40000
7fd68000	0	        0	                8000	        10000	1	0
7fd70000	7fd70000	404	                8000	        1000	404	40000
7fd78000	0	        0	                8000     	10000	1	0
7fd80000	7fd80000	404              	1c000	        1000	404  	40000
Depends on: 1273685
Will we switch to 64bit firefox whenever a user runs 64bit OS, or can we make that happen?
(In reply to Ting-Yu Chou [:ting] from comment #29)
> I was wondering why there're so many tiny (<1M) blocks, by using
> minidump-memorylist, I saw many regions are with memory protection
> PAGE_WRITECOMBINE and type MD_MEMORY_TYPE_MAPPED, not sure what are they for:

I placed breakpoints at VirtualAlloc* see if they can be hit with PAGE_WRITECOMBINE set, but no luck. I don't know, it could be allocated by drivers or something else. Another thing I noticed could cause fragmentation is random address allocation in js::jit::ExecutableAllocator::systemAlloc() [1] which is for security, see bug 677272.

[1] https://dxr.mozilla.org/mozilla-central/rev/c4449eab07d39e20ea315603f1b1863eeed7dcfe/js/src/jit/ExecutableAllocatorWin.cpp#226
No longer depends on: 1273685
Depends on: 1274706
No longer depends on: 1274706
Removing some fairly generic OOM signatures from blocking this bug.
No longer depends on: 1229384, 1257387, 1263916
Priority: P1 → --
I think we can call this fixed, though of course there are still remaining memory improvements that could be made.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: