1219914 - 25MiB AWSY regression when re-enabling jemalloc 4

Reporter

Description

•

9 years ago

When Jemalloc 4 was re-enabled on 9/23 we saw a 25MiB jump on AWSY. At the time glandium thought it must have been changes to Jemalloc 4 b/w disabling on 9/17 and re-enabling on 9/23.

Eric Rahm [:erahm]

Reporter

Comment 1

•

9 years ago

Mike, did we ever track down what changed in b/w disabling and re-enabling jemalloc 4 last time?

Flags: needinfo?(mh+mozilla)

Joel Maher ( :jmaher ) (UTC -8)

Comment 2

•

9 years ago

I am waiting on this to be sorted out before enabling jemalloc4 again on trunk.

Eric Rahm [:erahm]

Reporter

Updated

•

9 years ago

Whiteboard: [MemShrink] → [MemShrink:P1]

Ting-Yu Chou [:ting] (away)

Comment 3

•

8 years ago

The memory report diff between these two revisions

  387.06MiB https://hg.mozilla.org/integration/mozilla-inbound/rev/e892727a373a
  420.81MiB https://hg.mozilla.org/integration/mozilla-inbound/rev/81fca7e4e6ef

on AWSY of RSS: After TP5, tabs closed:

  33.21 MB (100.0%) -- explicit
  ├──29.90 MB (90.03%) -- heap-overhead
  │  ├──28.55 MB (85.99%) ── page-cache
  │  ├───4.63 MB (13.94%) ── bookkeeping
  │  └──-3.29 MB (-9.90%) ── bin-unused
  ├───6.73 MB (20.28%) ── heap-unclassified
  ├──-2.08 MB (-6.27%) -- js-non-window

Ting-Yu Chou [:ting] (away)

Comment 4

•

8 years ago

According to https://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsMemoryReporterManager.cpp#1351, page-cache is not something need to worry about. Then the problem would be heap-unclassified (I missed in comment 3):

33.21 MB (100.0%) -- explicit
├──29.90 MB (90.03%) -- heap-overhead
│  ├──28.55 MB (85.99%) ── page-cache
│  ├───4.63 MB (13.94%) ── bookkeeping
│  └──-3.29 MB (-9.90%) ── bin-unused
├───6.73 MB (20.28%) ── heap-unclassified
├──-2.08 MB (-6.27%) -- js-non-window
├──-1.08 MB (-3.27%) -- images
├──-0.63 MB (-1.89%) ++ (14 tiny)
└───0.37 MB (01.12%) -- gfx

Ting-Yu Chou [:ting] (away)

Comment 5

•

8 years ago

I tried to test locally see if I can get what's in the heap-unclassified, but I got smaller heap-unclassified with 81fca7e4e6ef:

1. open 10 tabs
2. let tab 1 open the first website in tp5n, tab 2 for the second, and then go back to tab 1 for the eleventh...
3. open a new tab for about:memory
4. minimize memory usage, and get memory report
5. save dmd

Will try again with the becnhtester in areweslimyet.

Ting-Yu Chou [:ting] (away)

Comment 6

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #5)
> Will try again with the becnhtester in areweslimyet.

Couldn't find a way to launch firefox binary with dmd by run_slimtest.py...

Ting-Yu Chou [:ting] (away)

Comment 7

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #6)
> Couldn't find a way to launch firefox binary with dmd by run_slimtest.py...

OK, do following could enable it:

  export LD_PRELOAD=$OBJDIR/dist/bin/libdmd.so \
  export LD_LIBRARY_PATH=$OBJDIR/dist/bin/ \
  export DMD=1

Ting-Yu Chou [:ting] (away)

Comment 8

•

8 years ago

Another problem is I don't know how not to close the browser after finishing the test so I have time to save dmd...

Ting-Yu Chou [:ting] (away)

Comment 9

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #8)
> Another problem is I don't know how not to close the browser after finishing
> the test so I have time to save dmd...

I added "os.system("killall -34 firefox")" to test_open_tabs().

Mike Hommey [:glandium]

Comment 10

•

8 years ago

Note that since this bug was filed, jemalloc 4 was upgraded. So we should get new AWSY numbers.

Flags: needinfo?(mh+mozilla)

Ting-Yu Chou [:ting] (away)

Comment 11

•

8 years ago

OK, Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=efa504ec15ae

I will enqueue the build to AWSY later.

Ting-Yu Chou [:ting] (away)

Comment 12

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #11)
> I will enqueue the build to AWSY later.

Enqueued. https://areweslimyet.com/?series=bug1219914

Eric Rahm [:erahm]

Reporter

Comment 13

•

8 years ago

Ting-Yu, thank you for picking this back up. Let me know if you need anymore help with AWSY.

(In reply to Ting-Yu Chou [:ting] from comment #4)
> According to
> https://dxr.mozilla.org/mozilla-central/source/xpcom/base/
> nsMemoryReporterManager.cpp#1351, page-cache is not something need to worry
> about. Then the problem would be heap-unclassified (I missed in comment 3):

Page cache is definitely a problem. njn just landed patches reducing it in mozjemalloc in bug 1258257.

Ting-Yu Chou [:ting] (away)

Comment 14

•

8 years ago

Got it, thanks Eric! I am taking sick leave today, will resume the work tomorrow.

Ting-Yu Chou [:ting] (away)

Comment 15

•

8 years ago

(In reply to Eric Rahm [:erahm] from comment #13)
> Page cache is definitely a problem. njn just landed patches reducing it in
> mozjemalloc in bug 1258257.

Eric, could you explain me a bit why page cache is a problem if it's just kept around for optimization? Thanks.

Flags: needinfo?(erahm)

Eric Rahm [:erahm]

Reporter

Comment 16

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #15)
> (In reply to Eric Rahm [:erahm] from comment #13)
> > Page cache is definitely a problem. njn just landed patches reducing it in
> > mozjemalloc in bug 1258257.
> 
> Eric, could you explain me a bit why page cache is a problem if it's just
> kept around for optimization? Thanks.

It holds onto a significant portion of unused memory that could otherwise be returned to the system. This is supposed to be a speed optimization, but after shrinking mozjemalloc's page cache we did not see an impact on our performance tests.

As we start enabling more content processes this overhead becomes much more significant (25MiB per process).

Flags: needinfo?(erahm)

Ting-Yu Chou [:ting] (away)

Comment 17

•

8 years ago

This is the diff (TabsOpenForceGC) from not enabling and enabling jemalloc4 [1]:

43.15 MB (100.0%) -- explicit
├──35.67 MB (82.65%) -- heap-overhead
│  ├──33.91 MB (78.57%) ── page-cache
│  ├───3.64 MB (08.45%) ── bookkeeping
│  └──-1.88 MB (-4.36%) ── bin-unused
├───7.14 MB (16.56%) ── heap-unclassified
├───1.07 MB (02.48%) ++ js-non-window
├──-1.29 MB (-2.98%) ++ images
├───1.03 MB (02.38%) ++ gfx
└──-0.46 MB (-1.08%) ++ (16 tiny)

[1] https://areweslimyet.com/?series=bug1219914

Ting-Yu Chou [:ting] (away)

Comment 18

•

8 years ago

For jemalloc4:

  https://dxr.mozilla.org/mozilla-central/source/memory/jemalloc/src/src/arena.c#1407-1418

It seems page-cache is controlled by opt_lg_dirty_mult which is default 3, and arena_maybe_purge_ratio() will make sure there are less than (arena->nactive >> 3 or at least chunk_npages) dirty pages. So by default each arena will have at least 2MB of dirty pages. If we set opt_lg_dirty_mult large enough, we will have 2MB dirty pages per arena at maximum. Though this doesn't align to bug 1258257 which have 1MB at maximum.

Ting-Yu Chou [:ting] (away)

Comment 19

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #17)
> ├───7.14 MB (16.56%) ── heap-unclassified

I tried to get DMD from local AWSY test, but didn't have similar diff on heap-unclassified...

Ting-Yu Chou [:ting] (away)

Comment 20

•

8 years ago

Attached file poc — Details

With this POC which limit the page cache to 1MB, the explicit memory of TabsOpen / TabsOpenSettles when jemalloc4 is enabled are smaller than mozjemalloc, but TabsOpenForceGC and TabsClosed are larger. Note page cache are all similar, also the heap-unclassified except in TabsClosed:

TabsOpen
  -4.58 MB (100.0%) -- explicit
  ├──-3.51 MB (76.56%) ++ js-non-window
  ├───3.23 MB (-70.44%) ++ heap-overhead
  ├──-3.28 MB (71.70%) ++ images
  ├──-1.32 MB (28.78%) ── heap-unclassified
  ├──-0.38 MB (08.27%) ++ window-objects
  ├───1.01 MB (-21.95%) ++ gfx
  ├──-0.33 MB (07.12%) ++ xpconnect
  ├───0.14 MB (-2.97%) ++ storage/sqlite
  ├───0.12 MB (-2.73%) ── preferences
  ├──-0.02 MB (00.43%) ++ add-ons
  ├──-0.09 MB (01.90%) ++ network
  ├──-0.08 MB (01.79%) ++ workers/workers(chrome)
  ├──-0.04 MB (00.97%) ++ layout
  └──-0.03 MB (00.58%) ++ (8 tiny)

TabsOpenSettled
  -1.62 MB (100.0%) -- explicit
  ├──-4.13 MB (254.16%) ++ js-non-window
  ├───3.32 MB (-204.14%) ++ heap-overhead
  ├──-1.02 MB (62.96%) ++ images
  ├───1.07 MB (-65.56%) ++ gfx
  ├──-0.46 MB (28.62%) ── heap-unclassified
  ├──-0.35 MB (21.74%) ++ xpconnect
  ├───0.14 MB (-8.37%) ++ storage/sqlite
  ├───0.13 MB (-7.69%) ── preferences
  ├──-0.06 MB (03.90%) ++ window-objects
  ├──-0.02 MB (01.21%) ++ add-ons
  ├──-0.09 MB (05.35%) ++ network
  ├──-0.08 MB (05.04%) ++ workers/workers(chrome)
  ├──-0.04 MB (02.73%) ++ layout
  └──-0.00 MB (00.06%) ++ (8 tiny)

TabsOpenFoprceGC
  7.04 MB (100.0%) -- explicit
  ├──6.96 MB (98.85%) -- heap-overhead
  │  ├──3.55 MB (50.41%) ── bookkeeping
  │  ├──3.24 MB (46.06%) ── bin-unused
  │  └──0.17 MB (02.38%) ── page-cache
  ├──1.26 MB (17.85%) ++ js-non-window
  ├──-1.28 MB (-18.22%) ++ images
  ├──1.03 MB (14.56%) ++ gfx
  ├──-0.45 MB (-6.33%) ── heap-unclassified
  ├──-0.29 MB (-4.08%) ++ xpconnect
  ├──-0.22 MB (-3.10%) ++ window-objects
  ├──0.13 MB (01.77%) ── preferences
  ├──-0.02 MB (-0.28%) ++ add-ons
  ├──-0.09 MB (-1.21%) ++ workers/workers(chrome)
  └──0.01 MB (00.18%) ++ (11 tiny)

TabsClosed
  17.83 MB (100.0%) -- explicit
  ├──34.26 MB (192.14%) ++ js-non-window
  ├──-23.74 MB (-133.15%) ── heap-unclassified
  ├───4.77 MB (26.77%) -- heap-overhead
  │   ├──3.59 MB (20.13%) ── bookkeeping
  │   ├──1.33 MB (07.45%) ── bin-unused
  │   └──-0.14 MB (-0.81%) ── page-cache
  ├───3.40 MB (19.04%) ++ window-objects
  ├──-1.13 MB (-6.32%) ++ images
  ├───0.41 MB (02.30%) ++ xpconnect
  └──-0.14 MB (-0.77%) ++ (15 tiny)

Ting-Yu Chou [:ting] (away)

Comment 21

•

8 years ago

Based on comment 20, I think it's worth to make a PR to jemalloc so we can have a hard limitation mode for purging dirty pages. Then we can move to enabling jemalloc 4.

What do you guys think?

Ting-Yu Chou [:ting] (away)

Comment 22

•

8 years ago

The try runs are running Talos so I can compare perf later.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=093575f5e79c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=23e6b507ab8d
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=093575f5e79c&newProject=try&newRevision=23e6b507ab8d&framework=1&showOnlyImportant=0

Ting-Yu Chou [:ting] (away)

Updated

•

8 years ago

Summary: 25MiB AWSY regression when re-enabling jemalloc 4 on 9/23 → 25MiB AWSY regression when re-enabling jemalloc 4

Ting-Yu Chou [:ting] (away)

Comment 23

•

8 years ago

Here's the summary:

a. The 25MiB AWSY regression is from page-cache, set dirty pages to a hard limit 1MiB instead of ratio purging could remove regressed page-cache and heap-unclassified, see comment 20. [1]
b. Enabling jemalloc4 causes many regressions on Talos. [2]
c. Somehow enabling jemalloc4 failed Windows XP and 8 builds on Try. [3][4]

I guess jemalloc4 won't be enabled soon because of b), in that case we should land bug 1005844 in mozjemalloc at first. We can file a bug to transform bug 1005844 to a hook later when jemalloc4 gets enabled.

What do you think?

[1] https://areweslimyet.com/?series=bug1219914
[2] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=093575f5e79c&newProject=try&newRevision=efa504ec15ae&framework=1&showOnlyImportant=0
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=efa504ec15ae
[4] https://treeherder.mozilla.org/#/jobs?repo=try&revision=23e6b507ab8d

Flags: needinfo?(mh+mozilla)

Flags: needinfo?(erahm)

marvinhk

Comment 24

•

8 years ago

it has been a bit long as this bug has been opened for 6 months and it looks like a deadlock right now.
it may make more sense to evaluate the status of jemalloc4 first?

the question is whether jemalloc4 (bug 762449) offers the sufficient benefits to move forward?
> if no, does it mean we stick with mozjemalloc onwards?
> if yes, investigate whether those regressions on Talos are actionable
  > if yes, do we want to wait for these improvements?
  > if no, do the trade off worthy?

Ryan VanderMeulen [:RyanVM]

Comment 25

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #23)
> c. Somehow enabling jemalloc4 failed Windows XP and 8 builds on Try. [3][4]

This may be due to bug 1261226. FWIW, I've been building w/ jemalloc4 enabled on Windows for some time now and things generally Just Work. Anyway, you can attempt another Try push with a newer upstream rev and see if it works better.

Ting-Yu Chou [:ting] (away)

Comment 26

•

8 years ago

What do you think about comment 23?

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 27

•

8 years ago

I don't know really know anything about jemalloc, sorry. Hopefully Glandium will be able to reply.

Flags: needinfo?(continuation)

Eric Rahm [:erahm]

Reporter

Comment 28

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #23)
> Here's the summary:
> 
> a. The 25MiB AWSY regression is from page-cache, set dirty pages to a hard
> limit 1MiB instead of ratio purging could remove regressed page-cache and
> heap-unclassified, see comment 20. [1]
> b. Enabling jemalloc4 causes many regressions on Talos. [2]
> c. Somehow enabling jemalloc4 failed Windows XP and 8 builds on Try. [3][4]
> 
> I guess jemalloc4 won't be enabled soon because of b), in that case we
> should land bug 1005844 in mozjemalloc at first. We can file a bug to
> transform bug 1005844 to a hook later when jemalloc4 gets enabled.
> 
> What do you think?
> 
> [1] https://areweslimyet.com/?series=bug1219914
> [2]
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=093575f5e79c&newProject=try&newR
> evision=efa504ec15ae&framework=1&showOnlyImportant=0
> [3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=efa504ec15ae
> [4] https://treeherder.mozilla.org/#/jobs?repo=try&revision=23e6b507ab8d

FWIW I think trying to land bug 1005844 is a good idea. It might make sense to have jmaher look at those talos results to weigh in on whether or not they're acceptable. It's also possible glandium can work his jemalloc-regression-hunting magic to find issues in jemalloc itself.

Flags: needinfo?(erahm) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 29

•

8 years ago

it would be nice to get updated try pushes if possible- that is 5+ weeks old, the code base changes often.

I don't like so many regressions- while they are all <5%, we have a lot of small regressions that pop in- these are more e10s regressions than both e10s/non-e10s- I would want to coordinate with the e10s team as they have focused hard to get e10s <= non-e10s.

let me know if we can get new try pushes, that would be nice to have.

Flags: needinfo?(jmaher)

Ting-Yu Chou [:ting] (away)

Comment 30

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a265ddbc7167

Ting-Yu Chou [:ting] (away)

Comment 31

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ad5f7564b51

Ting-Yu Chou [:ting] (away)

Comment 32

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=27180c8b41c8

Ting-Yu Chou [:ting] (away)

Comment 33

•

8 years ago

Not sure how to enable jemalloc4 now...

Ryan VanderMeulen [:RyanVM]

Comment 34

•

8 years ago

ac_add_options --enable-jemalloc=4

Ting-Yu Chou [:ting] (away)

Comment 35

•

8 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #34)
> ac_add_options --enable-jemalloc=4

Tried, but didn't see MOZ_JEMALLOC4 in obj-x86_64-pc-linux-gnu/config.status, is this normal?

Ryan VanderMeulen [:RyanVM]

Comment 36

•

8 years ago

hrm, it's definitely there on my local builds.

Ting-Yu Chou [:ting] (away)

Comment 37

•

8 years ago

(In reply to Ting-Yu Chou [:ting] from comment #35)
> Tried, but didn't see MOZ_JEMALLOC4 in
> obj-x86_64-pc-linux-gnu/config.status, is this normal?

Hmm... it does not have MOZ_JEMALLOC4 if I put the option in

  build/mozconfig.common.override, or
  build/mozconfig.common

but it does if I put in .mozconfig, how should I do for a Try push?

Ting-Yu Chou [:ting] (away)

Comment 38

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0532a1c313b7

Ting-Yu Chou [:ting] (away)

Comment 39

•

8 years ago

Add "ac_add_options --enable-jemalloc=4" to build/mozconfig.common.override will have jemalloc4 enabled on Try but not local build. Anyway, here's the updated comparison:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=85592c49e70b&newProject=try&newRevision=c96f8c0e8343&framework=1&showOnlyImportant=0

Note the tests are still running when I post this, and sorry for the Try pushes spamming.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 40

•

8 years ago

given the compare link above, we have a lot of useful data to work with (linux64, osx10.10, win7).  There are a lot of regressions and a few improvements.

improvments:
2-3% linux64: glterrain, kraken, kraken-e10s, main_rss, tp5o private bytes, tps
203% osx: tp5o

regressions- too many to list just look at the compare link above.  seems that win7 is the primary platform for regressions, many regressions are in the 2-5%, here are the largest regressions seen:
>7% win7: a11y, damp, cart, sessionrestore_no_auto_restore, tart, tp5 responsiveness, tp5o scroll
>7% linux64: tabpaint, tp5o scroll
>7% osx10: tps!!!

I recall a large list of regressions when this was landed originally.  I assume there is nothing we can do to reduce these regressions, but it would be good to understand them in more detail before moving forward (i.e. is this a relic of the hardware we use, build flags, etc.)

Flags: needinfo?(jmaher)

Ting-Yu Chou [:ting] (away)

Comment 41

•

8 years ago

(In reply to Joel Maher (:jmaher) from comment #29)
> in- these are more e10s regressions than both e10s/non-e10s- I would want to coordinate with the e10s > team as they have focused hard to get e10s <= non-e10s.

Would you coordinated with the team based on the updated comparison?

Flags: needinfo?(jmaher)

Mike Hommey [:glandium]

Comment 42

•

8 years ago

Jemalloc 4.2.1 is out and, as preliminary testing seems to indicate, seems much better than what we currently have in tree, enough that I'm seriously considering enabling and see what happens after bug 1277704 lands.

Depends on: 1277704

Flags: needinfo?(mh+mozilla)

Joel Maher ( :jmaher ) (UTC -8)

Comment 43

•

8 years ago

I have let this needinfo go way too long.  I did chat with the e10s team and they don't have release criteria for anything after version 48, so in this case lets look at 4.2.1 and see if we can make something stick.

Flags: needinfo?(jmaher)

Ryan VanderMeulen [:RyanVM]

Comment 44

•

7 years ago

In case anybody is wondering, here's the current state of Talos with and without jemalloc4 (now 4.4.0) enabled. Not sure how easily we can run a test on AWSY.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=4a2697532c50&newProject=try&newRevision=3d8ac051ac1e&framework=1&showOnlyImportant=0

Nicholas Nethercote [inactive]

Comment 45

•

7 years ago

> Not sure how easily we can run a test on AWSY.

https://areweslimyet.com/faq.htm#how-can-i-request-additional-awsy-tests-on-specific-changesets-or-try-pushes

Alex Gaynor [:Alex_Gaynor]

Updated

•

7 years ago

Blocks: jemalloc4-by-default

Alex Gaynor [:Alex_Gaynor]

Comment 46

•

7 years ago

Testing locally, I've found that setting |lg_dirty_mult:6| (using |JE_MALLOC_CONF|) has approximately the same effect as :ting found with explicitly setting |threshold|: page-cache goes from ~20MB down to 1-2MB.

This seems like a good compromise, since for very large processes it'll still allow gradually keeping more pages around, and it can be solved with configuration instead of needing to mess with the source. Once bug #1353752 is solved, I'll work on a patch to configure our default lg_dirty_mult to be 6, and then run through AWSY and confirm my local measurements.

Comment hidden (mozreview-request)

This has the impact of reducing the page size cache from ~20MB of memory to 1-2MB. This is important because we now pay that penalty in memory usage for each process, and we'd rather not. The performance impact of this change is believed to be minimal, as a similar change we implemented in mozjemalloc without significant impact.

Review commit: https://reviewboard.mozilla.org/r/126942/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/126942/

Alex Gaynor [:Alex_Gaynor]

Updated

•

7 years ago

Attachment #8855021 - Flags: review?(mh+mozilla)

Eric Rahm [:erahm]

Reporter

Comment 48

•

7 years ago

Alex can you run this against awsy? It would be nice to get several retriggers so we can compare in perfherder. The try format for all supported platforms is:

> try: -b o -p linux32,linux64,win32,win64 -u awsy-e10s -t none

Before pushing I think you'll need to update the in-tree configs to enable jemalloc 4.

Flags: needinfo?(agaynor)

Alex Gaynor [:Alex_Gaynor]

Comment 49

•

7 years ago

Eric, yes I will, but first it's blocked on bug #1353752 (I ended up working out of order with what I said after I realized Mike was on Tokyo time :D). Once that's landed I'll kick the try bots. (Leaving ni on since the next action here is still on me)

Eric Rahm [:erahm]

Reporter

Comment 50

•

7 years ago

(In reply to Alex Gaynor from comment #49)
> Eric, yes I will, but first it's blocked on bug #1353752 (I ended up working
> out of order with what I said after I realized Mike was on Tokyo time :D).
> Once that's landed I'll kick the try bots. (Leaving ni on since the next
> action here is still on me)

Oh right, you're using review board and autoland. I can push to try for you, I'll add a note here when I do.

Alex Gaynor [:Alex_Gaynor]

Comment 51

•

7 years ago

Thanks!

Flags: needinfo?(agaynor)

Eric Rahm [:erahm]

Reporter

Comment 52

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=22e2081b0e0d1e757e4e980c6ee6b29d21488c53

Eric Rahm [:erahm]

Reporter

Comment 53

•

7 years ago

Alright, so without jemalloc=4:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=22e2081b0e0d1e757e4e980c6ee6b29d21488c53

And with jemalloc=4:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=91515528d453f08cd9fd28f552d1dd1b70645c2b

Mike Hommey [:glandium]

Comment 54

•

7 years ago

mozreview-review

Comment on attachment 8855021 [details]
Bug 1219914 - increase jemalloc's 'lg_dirty_mult' parameter from the default of 3 to 6

https://reviewboard.mozilla.org/r/126942/#review129700

Using jemalloc's default was done on purpose in bug 1203840 to avoid purges happening at relatively high frequency on free(), but rather trigger it manually after cycle collection. Somehow, AWSY wasn't satisfied with this strategy, which is either a problem with when AWSY does its measurements, or a problem with when the purges are triggered. IIRC from when I looked, it tended to be the former. It would be good for fresh eyes to look into it again.

Attachment #8855021 - Flags: review?(mh+mozilla)

Eric Rahm [:erahm]

Reporter

Comment 55

•

7 years ago

AWSY does measurements immediately after testing (TabsOpen), 30 seconds later (TabsOpenSettled), and after forcing a GC (TabsOpenForcedGC).

Eric Rahm [:erahm]

Reporter

Comment 56

•

7 years ago

So for Linux we see an average of a 10MB regression in Explicit Memory After tabs open [+30s, forced GC] [1]. Windows seems happy (this is a new test, so I have lower confidence in it).

Interesting parts of the diff of memory reports [2,3]:

> Web Content (pid NNN)
> 
> ├──11.60 MB (278.77%) -- heap-overhead
> │  ├───9.12 MB (219.15%) ── bookkeeping [4]
> │  ├───1.72 MB (41.32%) ── bin-unused [4]
> │  └───0.76 MB (18.31%) ── page-cache [4]
> 
> Main Process (pid NNN)
> 
> ├──2.34 MB (93.88%) -- heap-overhead
> │  ├──1.64 MB (65.89%) ── bookkeeping
> │  ├──1.28 MB (51.43%) ── page-cache
> │  └──-0.58 MB (-23.44%) ── bin-unused

It looks like the bigger loss was in bookkeeping, so maybe jemalloc4 just has higher overhead in general?

[1] https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=60d670e34950d5414175b06081e861c800df7e96&newProject=try&newRevision=91515528d453f08cd9fd28f552d1dd1b70645c2b&originalSignature=a630e0d4f7000610ca57d0f8da52b55d117632a9&newSignature=a630e0d4f7000610ca57d0f8da52b55d117632a9&framework=4
[2] https://queue.taskcluster.net/v1/task/ZpIEZEohSeagTdOVyw24dw/runs/0/artifacts/public/test_info//memory-report-TabsOpenForceGC-2.json.gz
[3] https://queue.taskcluster.net/v1/task/bbzJ2DiMSGCCkGqwAfPXPg/runs/0/artifacts/public/test_info//memory-report-TabsOpenForceGC-2.json.gz

Eric Rahm [:erahm]

Reporter

Comment 57

•

7 years ago

Also note the original regression was single process, now we're using e10s-multi (1 chrome, 4 content), so relatively speaking this is *much* better.

Eric Rahm [:erahm]

Reporter

Comment 58

•

7 years ago

I've gone ahead and pushed a build enabling jemalloc4 but without this patch:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a840ec89de36478d526d4b1aaeac161143fdbd64

Alex Gaynor [:Alex_Gaynor]

Comment 59

•

7 years ago

From the build with jemalloc4, but not this patch [0], I'm seeing that in _one_ Web Content process has a high page-cache:

├───29.99 MB (22.80%) -- heap-overhead
│   ├──17.69 MB (13.45%) ── bin-unused
│   ├───8.13 MB (06.18%) ── page-cache
│   └───4.17 MB (03.17%) ── bookkeeping

however, all the other processes (1 main, 3 web content) all have page-cache in the 0-2MB range. I'm still digging into how the AWSY tests work; is there any reason we'd expect such a significant deviation between the web content processes?

[0] https://queue.taskcluster.net/v1/task/PFTs4UwrTDWqFuE8--Gz4Q/runs/0/artifacts/public/test_info//memory-report-TabsOpenForceGC-2.json.gz

Alex Gaynor [:Alex_Gaynor]

Comment 60

•

7 years ago

A few perfherder links:

base vs jemalloc4 (no patch): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=22e2081b0e0d1e757e4e980c6ee6b29d21488c53&newProject=try&newRevision=a840ec89de36478d526d4b1aaeac161143fdbd64&framework=4&showOnlyImportant=0

jemalloc 4 (no patch) vs jemalloc 4 (with patch): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a840ec89de36478d526d4b1aaeac161143fdbd64&newProject=try&newRevision=91515528d453f08cd9fd28f552d1dd1b70645c2b&framework=4&showOnlyImportant=0

base vs jemalloc 4 (with patch): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=22e2081b0e0d1e757e4e980c6ee6b29d21488c53&newProject=try&newRevision=91515528d453f08cd9fd28f552d1dd1b70645c2b&framework=4&showOnlyImportant=0


Summary:

base v jemalloc 4 (no patch): A handful of significant regressions with jemalloc4
jemalloc 4 (no patch) vs jemalloc 4 (with patch): One significant improvement with the patch
base v jemalloc 4 (with patch): All in the noise

(Really hope I did this correctly, my first time using perfherder!)

Before I opine on possible future directions, would love if someone can confirm that my attempts to use the tools and summarize what they say is correct!

Eric Rahm [:erahm]

Reporter

Comment 61

•

7 years ago

(In reply to Alex Gaynor from comment #60)
> Summary:
> 
> base v jemalloc 4 (with patch): All in the noise
> (Really hope I did this correctly, my first time using perfherder!)

For this metric it's best to expand the subtests (hover over the platform and the click 'subtests'). There you can see the 10MB regression after gc. 

> Before I opine on possible future directions, would love if someone can
> confirm that my attempts to use the tools and summarize what they say is
> correct!

Yes you're using them correctly, sometimes you have to dig into the subtests though. Perfherder uses a geometric mean of the subtests for comparison at a higher level, this finds rather large regressions but might miss regressions in just one of the checkpoints (in this case tabs loaded, after gc is probably the most important).

Eric Rahm [:erahm]

Reporter

Comment 62

•

7 years ago

Sorry I missed this.

(In reply to Alex Gaynor from comment #59)
> From the build with jemalloc4, but not this patch [0], I'm seeing that in
> _one_ Web Content process has a high page-cache:
> 
> ├───29.99 MB (22.80%) -- heap-overhead
> │   ├──17.69 MB (13.45%) ── bin-unused
> │   ├───8.13 MB (06.18%) ── page-cache
> │   └───4.17 MB (03.17%) ── bookkeeping
> 
> however, all the other processes (1 main, 3 web content) all have page-cache
> in the 0-2MB range. I'm still digging into how the AWSY tests work; is there
> any reason we'd expect such a significant deviation between the web content
> processes?

The test opens 100 pages in 30 tabs (round robin), close all but one, navigates to about:blank, repeats for a total of 3 times. So a couple possibilities:
  #1 - That process just loaded heavier pages
  #2 - That process didn't get killed b/w tests (perhaps an e10s-multi regression)
  #3 - Our GC helper is returning before GC runs in the child process

I'll look into 2 & 3.

Eric Rahm [:erahm]

Reporter

Comment 63

•

7 years ago

(In reply to Eric Rahm [:erahm] from comment #62)
> The test opens 100 pages in 30 tabs (round robin), close all but one,
> navigates to about:blank, repeats for a total of 3 times. So a couple
> possibilities:
>   #1 - That process just loaded heavier pages
>   #2 - That process didn't get killed b/w tests (perhaps an e10s-multi
> regression)
>   #3 - Our GC helper is returning before GC runs in the child process
> 
> I'll look into 2 & 3.

It looks #2 is probably the explanation. Basically the first tab is always the first content process, it never gets closed so it accumulates more cruft between iterations.

Alex Gaynor [:Alex_Gaynor]

Comment 64

•

7 years ago

And the more live allocated data, the more dirty pages jemalloc4 allows to be kept around. So that makes sense from the perspective of jemalloc4's automatic reclamation logic.

Would we have expected the cycle collector's purge logic to have been triggered here?

Eric Rahm [:erahm]

Reporter

Comment 65

•

7 years ago

(In reply to Alex Gaynor from comment #64)
> And the more live allocated data, the more dirty pages jemalloc4 allows to
> be kept around. So that makes sense from the perspective of jemalloc4's
> automatic reclamation logic.
> 
> Would we have expected the cycle collector's purge logic to have been
> triggered here?

I think the disconnect is that we force a full GC, not CC. The assumption is that with the 30 second pause CC should have kicked in, metrics indicates this *usually* the case [1], but I guess not guaranteed.

[1] https://mzl.la/2o7xlwg

Alex Gaynor [:Alex_Gaynor]

Comment 66

•

7 years ago

Interesting, that's a very handy metric!

The question I see now is: are we comfortable with the memory usage post-CC -- which seems to be 0-2MB or so (per process).

Assuming the answer is yes, there's a few options:

a) Run the CC more frequently
b) Trigger the purge on occasions besides the CC -- maybe on a full GC?
c) Make AWSY trigger the CC

(c) only makes sense if you assume that AWSY and "normal" usage diverge in their CC patterns. From the data you linked, it looks like 85% of cycle collections are less than 30 seconds apart.

Does that make sense? (Please let me know if I'm way off base, I'm just diving into Firefox memory management, and I'm layering what I'm learning on top of my experience with other VMs' GCs)

Comment hidden (mozreview-request)

Comment on attachment 8855021 [details]
Bug 1219914 - increase jemalloc's 'lg_dirty_mult' parameter from the default of 3 to 6

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/126942/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8855021 [details]
Bug 1219914 - increase jemalloc's 'lg_dirty_mult' parameter from the default of 3 to 6

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/126942/diff/2-3/

Mike Hommey [:glandium]

Comment 69

•

7 years ago

mozreview-review

Comment on attachment 8855021 [details]
Bug 1219914 - increase jemalloc's 'lg_dirty_mult' parameter from the default of 3 to 6

https://reviewboard.mozilla.org/r/126942/#review131712

I guess you didn't mean to send this to review. For one, the build/mozconfig.common change is ungranted, it's something for a try build. And it doesn't seem the discussion since comment 54 was conclusive this is necessary. I'd really rather avoid madvise as much as possible during "normal" operations, and have it more likely triggered by GC/CC/whatever.

Attachment #8855021 - Flags: review?(mh+mozilla)

Mike Hommey [:glandium]

Comment 70

•

7 years ago

Per bug 1363992, jemalloc 4 related bugs are now irrelevant.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Alex Gaynor [:Alex_Gaynor]

Updated

•

7 years ago

Attachment #8855021 - Attachment is obsolete: true

poc 8 years ago Ting-Yu Chou [:ting] (away) 1.86 KB, text/plain		Details
Bug 1219914 - increase jemalloc's 'lg_dirty_mult' parameter from the default of 3 to 6 7 years ago Alex Gaynor [:Alex_Gaynor] 59 bytes, text/x-review-board-request		Details