Closed Bug 762449 (jemalloc4-by-default) Opened 12 years ago Closed 7 years ago

Enable jemalloc 4 by default

Categories

(Core :: Memory Allocator, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
mozilla39
Tracking Status
firefox39 --- disabled
firefox43 --- fixed

People

(Reporter: glandium, Unassigned)

References

(Blocks 2 open bugs)

Details

Attachments

(5 files, 1 obsolete file)

      No description provided.
Blocks: 762451
Depends on: 741720
Although I'm no longer hopeful this will improve our fragmentation situation, it's still relevant to [MemShrink].
Whiteboard: [MemShrink]
Depends on: 763920
At this point it seems like memory consumption won't improve, but it would be nice to be on the upstream version of jemalloc.
Whiteboard: [MemShrink]
I spent some time looking at http://dromaeo.com/?dom-modify for another bug and noticed that it was spending a lot of time spinning on a lock. Today Ehsan, Jeff and I looked at it seems that the new jemalloc should improve the situation by using TLS for for the data structures used in some allocations at least.

Is there an easy way to enable it on a build in OS X to check that?
Note that when we initially turn on jemalloc3, we may disable the TLS cache, since it (unsurprisingly) appears to cause a memory-usage regression.

> Is there an easy way to enable it on a build in OS X to check that?

According to bug 580408 comment 60, you need to build with MOZ_JEMALLOC to get the new jemalloc.  But I'm not sure it works (or has even been tested on) OSX.
export MOZ_JEMALLOC=1 in your mozconfig.

(In reply to Justin Lebar [:jlebar] from comment #4)
> According to bug 580408 comment 60, you need to build with MOZ_JEMALLOC to
> get the new jemalloc.  But I'm not sure it works (or has even been tested
> on) OSX.

It was tested on all platforms. Only b2g is broken because of a toolchain problem.
Depends on: 799090
Depends on: 799093
Depends on: 801536
Depends on: 815071
No longer blocks: replace-malloc
Alias: jemalloc3-default
Alias: jemalloc3-default → jemalloc3-by-default
Depends on: 1014300
Depends on: 1014308
Depends on: 1107677
Depends on: 1107694
Depends on: 1108045
Depends on: 1110484
Depends on: 1110505
Depends on: 1110514
Completed triage of the last 3 years of commits [1], added 3 more blockers that can be resolved on mozilla's side. 3 changesets still need to be triaged by other folks.

[1] https://docs.google.com/document/d/1YkJaXVlO4uDHKE47Iel5hT5uRDHao2dku8BCSwGbtRs/edit?usp=sharing
So here is what I think we should do, considering the holidays and the timing wrt next uplift:
- Land bug 1107694 when there is a proper fix for it. I found what's wrong there, I just don't know what the right value for the fix is.
- Switch to jemalloc 3 by default on Jan 12 or 13, after the uplift.
- Resolve the other blockers to this bug: those that can be fixed before Jan 12 can be fixed before then, but at that point I don't think it's worth technically blocking on them, as long as we fix them in the following 6 weeks, and if we don't, we can still make jemalloc 3 not ride the train. FTR, I have a bunch of WIP patches applied on my git clone for the bugs I'm assigned to ; I just won't have them ready before next year because of holidays :)
FYI I'm keeping an eye on this from the FxOS side mostly to ensure we don't hit memory usage regressions because - as usual - we're dealing with devices with a very tight memory budget.
Blocks: 1005844
Assignee: nobody → mh+mozilla
Attachment #8547838 - Flags: review?(n.nethercote)
In fact, we rely on the shell variable being set too, so set it.
Attachment #8547838 - Attachment is obsolete: true
Attachment #8547838 - Flags: review?(n.nethercote)
Attachment #8547840 - Flags: review?(n.nethercote)
Attachment #8547840 - Flags: review?(n.nethercote) → review+
And backed out:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ffafa737cb7c

DMD test failures, presumably because of size classes changes:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=5376229&repo=mozilla-inbound

More critically, there's an infinite loop involving a0alloc in both mac and windows:
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=5374694&repo=mozilla-inbound
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=5376065&repo=mozilla-inbound

And there's the b2g emulator issue, but it might be related to the above, I haven't attached a debugger yet.
(In reply to Mike Hommey [:glandium] from comment #13)
> And there's the b2g emulator issue, but it might be related to the above, I
> haven't attached a debugger yet.

So with a debugger attached, it looks like it's stuck in the libc, but I don't have symbols, and downloading/building a b2g emulator build is going to take a very long time. Eric, would you mind looking at this? I've reproduced with the emulator build I got from automation with "LD_PRELOAD=/system/b2g/libmozglue.so cat"
I'll take a look at this in the morning.
Flags: needinfo?(erahm)
This started on your push too. AFAICT, it was linux64 opt only.
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=5375538&repo=mozilla-inbound
Depends on: 1120798
(In reply to Mike Hommey [:glandium] from comment #13)
> More critically, there's an infinite loop involving a0alloc in both mac and
> windows:
> https://treeherder.mozilla.org/ui/logviewer.
> html#?job_id=5374694&repo=mozilla-inbound
> https://treeherder.mozilla.org/ui/logviewer.
> html#?job_id=5376065&repo=mozilla-inbound

Investigated and filed https://github.com/jemalloc/jemalloc/issues/184
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #16)
> This started on your push too. AFAICT, it was linux64 opt only.
> https://treeherder.mozilla.org/ui/logviewer.
> html#?job_id=5375538&repo=mozilla-inbound

I identified this one. It's either a bug in the pkcs11 loader or the pkcs11testmodule, or both. We've been really lucky it didn't happen before. I will file a bug about it tomorrow.
(In reply to Mark Finkle (:mfinkle) from comment #19)
> Looks like this patch regressed startup by ~200ms on Android too. Here's a
> quick screenshot:
> http://cl.ly/image/0G1Y3s0L422c
> 
> Link to the tests:
> http://phonedash.mozilla.org/#/org.mozilla.fennec/totalthrobber/local-blank/
> norejected/2015-01-12/2015-01-13/notcached/noerrorbars/standarderror

How does one test that on try?
Flags: needinfo?(mark.finkle)
(In reply to Mike Hommey [:glandium] from comment #14)
> (In reply to Mike Hommey [:glandium] from comment #13)
> > And there's the b2g emulator issue, but it might be related to the above, I
> > haven't attached a debugger yet.
> 
> So with a debugger attached, it looks like it's stuck in the libc, but I
> don't have symbols, and downloading/building a b2g emulator build is going
> to take a very long time. Eric, would you mind looking at this? I've
> reproduced with the emulator build I got from automation with
> "LD_PRELOAD=/system/b2g/libmozglue.so cat"

This may well be related to https://github.com/jemalloc/jemalloc/issues/184 because we don't have native tls on android and gonk, so we're effectively in the same kind of infinite-loopy setup as windows and mac.
(In reply to Mike Hommey [:glandium] from comment #20)
> (In reply to Mark Finkle (:mfinkle) from comment #19)
> > Looks like this patch regressed startup by ~200ms on Android too. Here's a
> > quick screenshot:
> > http://cl.ly/image/0G1Y3s0L422c
> > 
> > Link to the tests:
> > http://phonedash.mozilla.org/#/org.mozilla.fennec/totalthrobber/local-blank/
> > norejected/2015-01-12/2015-01-13/notcached/noerrorbars/standarderror
> 
> How does one test that on try?

One does not. Bob Clary is working on getting Try working for PhoneDash, but it's not completed. In the meantime, we do one of two things:
1. Send test patches to Bob and he runs them in the PhoneDash framework
2. We try to use a local script that launches Fennec via ADB and watches for Throbber Start and Throbber Stop messages.

I forget where the script for #2 lives. Bob might have other alternatives.
Flags: needinfo?(mark.finkle) → needinfo?(bob)
Depends on: 1120937
glandium, I can walk you through the set up of autophone if you have a slow, rooted android phone available or I can test your patches for you if you like.
Flags: needinfo?(bob)
(In reply to Mike Hommey [:glandium] from comment #14)
> (In reply to Mike Hommey [:glandium] from comment #13)
> > And there's the b2g emulator issue, but it might be related to the above, I
> > haven't attached a debugger yet.
> 
> So with a debugger attached, it looks like it's stuck in the libc, but I
> don't have symbols, and downloading/building a b2g emulator build is going
> to take a very long time. Eric, would you mind looking at this? I've
> reproduced with the emulator build I got from automation with
> "LD_PRELOAD=/system/b2g/libmozglue.so cat"

It appears |getprop| is deadlocked when initializing jemalloc. Our |__wrap_pthread_key_create| [1] function uses a std::map [2] which then tries to allocate memory resulting in a deadlock.

[1] https://hg.mozilla.org/mozilla-central/annotate/67257a3edeb5/mozglue/build/Nuwa.cpp#l724
[2] https://hg.mozilla.org/mozilla-central/annotate/67257a3edeb5/mozglue/build/Nuwa.cpp#l730

Full stack:
> #0  __futex_syscall3 () at bionic/libc/arch-arm/bionic/atomics_arm.S:183
> #1  0x40087264 in _normal_lock (mutex=<optimized out>) at bionic/libc/bionic/pthread.c:951
> #2  pthread_mutex_lock (mutex=0x4006f5d4) at bionic/libc/bionic/pthread.c:1041
> #3  0x400395b4 in malloc_init_hard () at /home/erahm/dev/mozilla-central/memory/jemalloc/src/include/jemalloc/internal/mutex.h:77
> #4  0x4003a098 in je_malloc () at /home/erahm/dev/mozilla-central/memory/jemalloc/src/src/jemalloc.c:249
> #5  0x4002a992 in std::priv::_Rb_tree<int, std::less<int>, std::pair<int const, void (*)(void*)>, std::priv::_Select1st<std::pair<int const, void (*)(void*)> >, std::priv::_MapTraitsT<std::pair<int const, void (*)(void*)> >, std::allocator<std::pair<int const, void (*)(void*)> > >::_M_create_node ()
>    at /home/erahm/dev/mozilla-central/build/stlport/stlport/stl/_new.h:134
> #6  0x4002c60a in std::priv::_Rb_tree<int, std::less<int>, std::pair<int const, void (*)(void*)>, std::priv::_Select1st<std::pair<int const, void (*)(void*)> >, std::priv::_MapTraitsT<std::pair<int const, void (*)(void*)> >, std::allocator<std::pair<int const, void (*)(void*)> > >::_M_insert(std::priv::_Rb_tree_node_base*, std::pair<int const, void (*)(void*)> const&, std::priv::_Rb_tree_node_base*, std::priv::_Rb_tree_node_base*) ()
>    at /home/erahm/dev/mozilla-central/build/stlport/stlport/stl/_tree.c:359
> #7  0x4002c6d0 in __wrap_pthread_key_create () at /home/erahm/dev/mozilla-central/build/stlport/stlport/stl/_tree.c:422
> #8  0x40043204 in je_malloc_tsd_boot0 () at /home/erahm/dev/mozilla-central/memory/jemalloc/src/include/jemalloc/internal/tsd.h:605
> #9  0x40039600 in malloc_init_hard () at /home/erahm/dev/mozilla-central/memory/jemalloc/src/src/jemalloc.c:1123
> #10 0x4003f0d8 in jemalloc_constructor () at /home/erahm/dev/mozilla-central/memory/jemalloc/src/src/jemalloc.c:249
> #11 0xb0001156 in call_array (ctor=0x4006d9c8, count=0, reverse=<optimized out>) at bionic/linker/linker.c:1589
> #12 0xb0001dc6 in call_constructors (si=<optimized out>) at bionic/linker/linker.c:1619
> #13 __dl_$t () at bionic/linker/linker.c:2013
> #14 0xb00028a4 in init_library (si=<optimized out>) at bionic/linker/linker.c:1169
> #15 find_library (name=<optimized out>) at bionic/linker/linker.c:1212
> #16 0xb0001b90 in __dl_$t () at bionic/linker/linker.c:1917
> #17 0xb0002108 in __linker_init (elfdata=<optimized out>) at bionic/linker/linker.c:2200
> #18 0xb000100c in __dl__start () at bionic/linker/arch/arm/begin.S:37
> #19 0xb000100c in __dl__start () at bionic/linker/arch/arm/begin.S:37
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Flags: needinfo?(erahm)
Depends on: 1121269
Depends on: 1121314
Bob, would you mind checking how startup goes for the builds from this try? Thanks.
https://treeherder.mozilla.org/#/jobs?repo=try&author=mh%40glandium.org
Flags: needinfo?(bob)
Ok, I'm testing api-9 and api-11 from http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-471936506ea6/ compared to the latest mozilla-inbound. It will take a while to download the builds, but I'll let you know as soon as I get the results.
Attached file jemalloc.org
comparison of http://hg.mozilla.org/try/rev/471936506ea6 to http://hg.mozilla.org/integration/mozilla-inbound/rev/ad2042b4c668

twitter and blank start times are comparable between the two builds but the webappstartup test start time regressed. Strangely the galaxy s3 Android 4.0 regressed more than the nexus one Android 2.3 on the webappstartup start time.

twitter, blank and webappstartup stop times all regressed.
Flags: needinfo?(bob)
So, I got autophone working with Bob's help, and got somehow plausible results despite the huge stddev. Then, since it was all slow, I factory resetted the phone and cleaned up its sd card. The phone ended up much faster (for instance, I don't need autophone config adjustments because of slow reboot anymore), but now the results are completely unexploitable: stddev is still big, and the jemalloc3 builds end up with better results than the mozjemalloc builds... which makes it hard to investigate what's wrong.

That being said, I have a theory, so I'd appreciate a test of those two try builds:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-36dd4fb91b48/try-android-api-11/
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-c00ce6c720e1/try-android-api-11/
Flags: needinfo?(bob)
glandium, it would help to get both api-9 and api-11 builds so I can test using both my nexus one and my gs3.

I posted these to phonedash-dev so you can see the graphs and ran your two try builds (the first didn't have any builds) along with the latest mozilla-inbound for comparison.

http://phonedash-dev.allizom.org/#/org.mozilla.fennec/throbberstop/local-blank/norejected/2015-01-15/2015-01-15/notcached/errorbars/standarderror

http://phonedash-dev.allizom.org/#/org.mozilla.fennec/throbberstop/local-twitter/norejected/2015-01-15/2015-01-15/notcached/errorbars/standarderror

http://phonedash-dev.allizom.org/#/org.mozilla.fennec/throbberstop/webappstartup/norejected/2015-01-15/2015-01-15/notcached/errorbars/standarderror

http://phonedash-dev.allizom.org/#/org.mozilla.fennec/throbberstop/webappstartup/norejected/2015-01-15/2015-01-15/notcached/errorbars/standarderror

As you can see, throbber stop regressed with both try builds on s1s2 blank and twitter and both throbber start and stop regressed on webappstartup.
Flags: needinfo?(bob)
FYI I was looking at the browsermark 2.1 knockout benchmark score. The difference between v8 and spidermonkey is the time we spend in "compareSmallArrayToBigArray" first loop. Which is currently mostly MinorGC, "freeHugeSlots", which does only js_free.

I created a js shell benchmark out of it in bug 1118938. The numbers should somewhat relate to the full browsermark, but the scores reported here are from the shell benchmark.

On trunk we have scores 2900ms with the first loop taking 1000ms. (On linux the loop takes 500ms, due to faster js_free).
I was told by ehoogeveen to test jemalloc3, since it could potentially improve scores. Bad luck here:
total score become: 3398ms, while the loop now takes 1422ms!
Bob, can you get numbers for these builds?
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-aef35ebef737/
Flags: needinfo?(bob)
You can see the various graphs at http://phonedash-dev.allizom.org/#/org.mozilla.fennec/throbberstop/local-blank/norejected/2015-01-13/2015-01-15/notcached/errorbars/standarderror

You can select the different tests local blank, local twitter, webappstartup and look at the start and stop times. The regression pattern remains pretty much the same.
Flags: needinfo?(bob)
(In reply to Mike Hommey [:glandium] from comment #34)
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-
> 31a67cc812d5/

This one is failing to get any measurements of the Throbber times and is generating literally millions of E/GeckoAlloc( 3342): overflow messages in logcat. It is taking quite a while to complete. It doesn't look like the logcat contains anything else of use.
Flags: needinfo?(bob)
(In reply to Mike Hommey [:glandium] from comment #34)

> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-
> 592a30a8a738/

This one also has the E/GeckoAlloc( 1566): overflow issue. It is still early in the run, but it don't look like this will get any measurements either. I'll let it continue though just in case.
Attached file gecko-alloc.zip
GeckoAlloc non-overflow messages only
When they're up, please test those builds:
 http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-d135ff58d0e7
 http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-dee0ae7edb6b

I'm only interested in the GeckoAlloc logs this time as well.
Flags: needinfo?(bob)
(In reply to Mike Hommey [:glandium] from comment #38)
> When they're up, please test those builds:
>  http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-
> d135ff58d0e7
>  http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-
> dee0ae7edb6b
> 
> I'm only interested in the GeckoAlloc logs this time as well.

If d135ff58d0e7 doesn't get throbber times, can you also try this one:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-3b57b77accb7/
provided info via irc
Flags: needinfo?(bob)
I think I finally got builds that can get the info I want. Please test those:
http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/mh@glandium.org-042225505427/
http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/mh@glandium.org-952452fa15ad/

For those, I'd be interested in their score and their GeckoAlloc output. Thanks.
Flags: needinfo?(bob)
done
Flags: needinfo?(bob)
I did two more try builds, with the autophone trigger, but it didn't trigger anything for the nexus one and the gs3 :(
Could you test these?
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-64566756bf92
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-9d46e7b57308
Flags: needinfo?(bob)
The nexus-one-3 and samsgung-gs3-3 devices are my local devices and won't have picked up your try builds unless I had set them up for it at the time and had my local instance of autophone running when you submitted them.

Several of your try builds have completed testing and are available at http://phonedash.mozilla.org/#/org.mozilla.fennec/throbberstop/local-twitter/norejected/2015-01-21/2015-01-21/cached/noerrorbars/standarderror/try

You still have outstanding jobs for nexus-s-3, nexus-5-kot49h-1, nexus-s-4, nexus-5-kot49h-3 and nexus-s-5 for

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-4b002c559b19/
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-64566756bf92/
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-9d46e7b57308/
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mh@glandium.org-c789abcd50e7/

To see only your builds, click on the legend items on the left for the non try builds to hide their series. 

To get the logs you will need to visit the staging treeherder instance for now, click on the relevant test and look in the job details panel for links to the logcat. Not the tombstones signifying crashes of some type. See

https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=64566756bf92
https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=9d46e7b57308
https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=4b002c559b19
https://treeherder.allizom.org/ui/#/jobs?repo=try&revision=c789abcd50e7

The workers are a bit behind at the moment. Several disconnected and I will need to reboot the system.
Flags: needinfo?(bob)
Autophone is now available in
http://trychooser.pub.build.mozilla.org/ to help with your try
syntax. Please limit your tests to the ones you need and don't
use the mochitests unless you really need them. Also remember we
don't have many devices and they are usually pretty busy so
please don't DOS them. I'll work on the documentation in the next
few days but this should work for most of your cases:

try: -b o -p android-api-9,android-api-11 -u autophone-s1s2 -t none

Until Autophone begins reporting to production Treeherder, you
can find logs etc on the staging instance
https://treeherder.allizom.org.

Let me know if you have issues.
(In reply to Bob Clary [:bc:] from comment #45)
> Autophone is now available in
> http://trychooser.pub.build.mozilla.org/ to help with your try
> syntax. Please limit your tests to the ones you need and don't
> use the mochitests unless you really need them. Also remember we
> don't have many devices and they are usually pretty busy so
> please don't DOS them. I'll work on the documentation in the next
> few days but this should work for most of your cases:
> 
> try: -b o -p android-api-9,android-api-11 -u autophone-s1s2 -t none
> 
> Until Autophone begins reporting to production Treeherder, you
> can find logs etc on the staging instance
> https://treeherder.allizom.org.
> 
> Let me know if you have issues.

Unfortunately, the result of my recent try builds don't show up properly on treeherder.allizom.org, so I can't get to their logcat :(
https://treeherder.allizom.org/#/jobs?repo=try&revision=790f61d4a2d8
https://treeherder.allizom.org/#/jobs?repo=try&revision=ceea827908dd
Flags: needinfo?(bob)
glandium: Sorry about the problems. Your first build appears on treeherder.allizom.org but the second doesn't.

Staging treeherder had an issue with netflows when they enabled the new db setup. I've gotten permission to submit to treeherder production with the jobs automatically hidden. I'd held off until I could find a clean switch over time but with the recent load, I don't think I'll ever find a perfect time. So, I've switched to reporting to treeherder production.

All new builds will report to treeherder.mozilla.org and any existing builds in the job queue will begin reporting there as well. We should have a much improved system availability and stability on production going forward.
Flags: needinfo?(bob)
Depends on: 1134123
Comment on attachment 8570335 [details] [diff] [review]
Make jemalloc's opt.lg_dirty_mult work as documented

Review of attachment 8570335 [details] [diff] [review]:
-----------------------------------------------------------------

rs=me
Attachment #8570335 - Flags: review?(n.nethercote) → review+
memory/build has fatal warnings, and this warning hits win64 only. I don't know how I didn't get this error before...
Attachment #8570394 - Flags: review?(n.nethercote)
Attachment #8570394 - Flags: review?(n.nethercote) → review+
Depends on: 1138705
Depends on: 1139036
Depends on: 1139905
No longer depends on: 1139905
Depends on: 1141079
Depends on: 1141761
Depends on: 1141660
Actually, let's keep this bug open to track definitely enabling jemalloc3 (as in, let it ride the trains)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1142403
Depends on: 1142412
Depends on: 1142414
this has a lot of improvements and 2 private byte regressions on linux (23 + 64):
http://alertmanager.allizom.org:8080/alerts.html?rev=0a11b73c77b7&showAll=1&testIndex=0&platIndex=0
as a note, these correspond to the improvements when this originally landed:
http://alertmanager.allizom.org:8080/alerts.html?rev=a1a89ff4ee31&showAll=1&testIndex=0&platIndex=0
Mike, any estimate when this can be re-enabled? Bug 1005844 is waiting on it.
Flags: needinfo?(mh+mozilla)
Depends on: 1201453
Depends on: 1201738
Flags: needinfo?(mh+mozilla)
Blocks: 1201802
https://hg.mozilla.org/mozilla-central/rev/8b380feae2ae
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Alias: jemalloc3-by-default → jemalloc4-by-default
Summary: Enable jemalloc 3 by default → Enable jemalloc 4 by default
Depends on: 1205289
No longer depends on: 1205289
On my local machine (Ubuntu 15.04 and Ubuntu 14.10). I got start-up crash with my custom build. Call stack is as follow:

#0  0x0000000000439002 in run_quantize (size=0) at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/src/arena.c:96
#1  0x00000000004396ef in run_quantize_first (size=4096) at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/src/arena.c:155
#2  0x0000000000446857 in arena_run_first_best_fit (arena=0x7ffff6a00180, size=4096) at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/src/arena.c:1073
#3  0x0000000000446fbb in arena_run_alloc_small_helper (arena=0x7ffff6a00180, size=4096, binind=20) at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/src/arena.c:1129
#4  0x0000000000447101 in arena_run_alloc_small (arena=0x7ffff6a00180, size=4096, binind=20) at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/src/arena.c:1148
#5  0x0000000000454b2e in arena_bin_nonfull_run_get (arena=0x7ffff6a00180, bin=0x7ffff6a01750) at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/src/arena.c:1898
#6  0x0000000000454c58 in arena_bin_malloc_hard (arena=0x7ffff6a00180, bin=0x7ffff6a01750) at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/src/arena.c:1940
#7  0x00000000004558c4 in je_arena_malloc_small (arena=0x7ffff6a00180, size=1024, zero=true) at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/src/arena.c:2153
#8  0x00000000004f8234 in je_arena_malloc (tcache=0x0, zero=true, size=1024, arena=0x7ffff6a00180, tsd=0x7ffff7fe66b0)
    at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/include/jemalloc/internal/arena.h:1145
#9  je_iallocztm (arena=0x0, is_metadata=false, tcache=0x0, zero=true, size=1024, tsd=0x7ffff7fe66b0) at src/include/jemalloc/internal/jemalloc_internal.h:887
#10 je_icalloc (size=1024, tsd=0x7ffff7fe66b0) at src/include/jemalloc/internal/jemalloc_internal.h:920
#11 je_calloc (num=1, size=1024) at /home/morris/mozilla/gecko-dev/memory/jemalloc/src/src/jemalloc.c:1663
#12 0x0000000000423b4b in calloc (num=1, size=1024) at /home/morris/mozilla/gecko-dev/memory/build/replace_malloc.c:181
#13 0x00007ffff65c8890 in PR_Calloc (nelem=1, elsize=1024) at /home/morris/mozilla/gecko-dev/nsprpub/pr/src/malloc/prmem.c:443
#14 0x00007ffff65c69ce in PR_SetThreadPrivate (index=2, priv=0x7ffff6937f98) at /home/morris/mozilla/gecko-dev/nsprpub/pr/src/threads/prtpd.c:161
#15 0x00007fffe4008c2d in mozilla::BlockingResourceBase::ResourceChainAppend (this=0x7ffff6937f98, aPrev=0x0) at ../../dist/include/mozilla/BlockingResourceBase.h:181
#16 0x00007fffe400381d in mozilla::BlockingResourceBase::Acquire (this=0x7ffff6937f98) at /home/morris/mozilla/gecko-dev/xpcom/glue/BlockingResourceBase.cpp:322
#17 0x00007fffe40039f8 in mozilla::OffTheBooksMutex::Lock (this=0x7ffff6937f98) at /home/morris/mozilla/gecko-dev/xpcom/glue/BlockingResourceBase.cpp:383
#18 0x00007fffe3e92f02 in mozilla::Monitor::Lock (this=0x7ffff6937f98) at ../../dist/include/mozilla/Monitor.h:35
#19 0x00007fffe3e92f62 in mozilla::MonitorAutoLock::MonitorAutoLock (this=0x7ffff7fe5e40, aMonitor=...) at ../../dist/include/mozilla/Monitor.h:78
#20 0x00007fffe4080b59 in mozilla::net::ClosingService::ThreadFunc (this=0x7ffff6937f80) at /home/morris/mozilla/gecko-dev/netwerk/base/ClosingService.cpp:206
#21 0x00007fffe4097f2e in mozilla::net::ClosingService::ThreadFunc (aClosure=0x7ffff6937f80) at /home/morris/mozilla/gecko-dev/netwerk/base/ClosingService.h:52
#22 0x00007ffff65e557e in _pt_root (arg=0x7ffff6855fc0) at /home/morris/mozilla/gecko-dev/nsprpub/pr/src/pthreads/ptthread.c:212
#23 0x00007ffff7bc26aa in start_thread (arg=0x7ffff7fe6700) at pthread_create.c:333
#24 0x00007ffff6ec6eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109


Revert it to jemalloc3 and everything is fine. But on my ubuntu with VM(host os is mac) jemelloc4 works well without problem.
This likely is bug 1205016
(In reply to Carsten Book [:Tomcat] from comment #59)
> https://hg.mozilla.org/mozilla-central/rev/8b380feae2ae

Why this bug is FIXED when the change was backed out in bug 1205249 and never came back?
Flags: needinfo?(mh+mozilla)
Status: RESOLVED → REOPENED
Flags: needinfo?(mh+mozilla)
Resolution: FIXED → ---
No longer blocks: 1291356
Mike, what's needed to move this forward? Maybe I can find someone to help on this.
Flags: needinfo?(mh+mozilla)
Depends on: 1277704
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #63)
> Mike, what's needed to move this forward? Maybe I can find someone to help
> on this.

First would be bug 1277704, if it doesn't break things.

Then there are at least two things:
- Figure out why 1219914 happened (iow, why bug 1203840 caused it, which would very well be attributed to how AWSY is measuring things)
- Test the reality of the talos regressions (cf. how the msvc 2015 regressions were found not to have an actual impact contrary to what talos said)
Flags: needinfo?(mh+mozilla)
Depends on: 1315285
No longer depends on: 1141761
Depends on: 1322027
Depends on: 1343432
Depends on: 1343441
Depends on: 1219914
Depends on: 1353752
No longer depends on: 1343441
Per bug 1363992, jemalloc 4 related bugs are now irrelevant.
Assignee: mh+mozilla → nobody
Status: REOPENED → RESOLVED
Closed: 9 years ago7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: