Closed Bug 1261143 Opened 8 years ago Closed 8 years ago

Ghost windows on web.telegram.org leading to long CC times

Categories

(Core :: Storage: IndexedDB, defect)

45 Branch
Unspecified
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 + fixed
firefox50 --- fixed

People

(Reporter: wcpan, Assigned: wcpan)

References

(Blocks 1 open bug)

Details

(Whiteboard: btpp-followup-2016-05-06)

Attachments

(4 files, 9 obsolete files)

381.70 KB, text/plain
Details
9.83 KB, text/plain
Details
1.03 KB, text/html
Details
1.01 KB, patch
khuey
: review+
Details | Diff | Splinter Review
Attached file memory-report.json.gz (obsolete) —
If you need concise or verbose CC/GC log I can upload it to somewhere.
Attached file javascript.options.mem.log.txt (obsolete) —
Log in browser console while javascript.options.mem.log=true .
Component: General → JavaScript: GC
Product: Firefox → Core
Attached file memory-report.json.gz (obsolete) —
Attachment #8736831 - Attachment is obsolete: true
Attachment #8736834 - Attachment is obsolete: true
Attached file census-cc.log (obsolete) —
find_roots result:

0x17d1029c0 [DOMEventTargetHelper ]
    --[mScriptOwner]--> 0x1234c6920 [JS Object (Window)]
    --[UnwrapDOMObject(obj)]--> 0x1254d6400 [nsGlobalWindow #37 inner https://web.telegram.org/#/im?p=@YSITD]

    Root 0x17d1029c0 is a ref counted object with 1 unknown edge(s).
    known edges:
       0x17d853190 [IDBObjectStore] --[mTransaction]--> 0x17d1029c0
Attached file xpcom_mem_refcnt.log (obsolete) —
Leak statistics when enabled XPCOM_MEM_REFCNT_LOG
What I do is:

1. open Firefox with a clean profile
2. set memory.ghost_window_timeout_seconds to 1
3. login to web.telegram.org
4. randomly scrolling chats, clicking links on the chats
5. close the tab
6. open about:memory
7. GC & CC
8. measure memory usage

Sometimes it will cause 2 ghost windows, sometimes it doesn't.
Just repeat the steps until the ghost windows appear.
Can not reproduce on Linux (Nightly).
The above logs were generated on Mac OS (Gecko 46).
Can not reproduce on Nightly and Mac OS too.
Maybe this is fixed in m-c?
Summary: Cycle collection takes long time (> 1000ms) → Ghost windows on web.telegram.org leading to long CC times
Blocks: GhostWindows
Component: JavaScript: GC → DOM: IndexedDB
I'll try to find some more time tomorrow to look at this.
(sorry wpan, had to run to a meeting today)

If this is FF46 only, finding the fix-range (not regression range this time) would be great, so that we can push the fixes to FF47 (assuming it leaks too) and to FF46 (leaks are bad, so even release builds should get those fixes asap, if possible, IMO)
Flags: needinfo?(wpan)
OK, it may takes some time though.
Flags: needinfo?(wpan)
Does not leak on Linux with 46, maybe a Mac OS only problem.

For Mac OS, 47 still got leaks, but only one ghost-window each time, not two.
48 does not seems to leak, need to check if it is an e10s-related issue.
Seems not e10s-related.

I guess now I have to bisect then.
ahaa, I don't have OSX with me now.

Andrew, could you perhaps help here?
Looks like telegram needs some phonenumber to register to, but a prepaid number was ok too.
Flags: needinfo?(continuation)
Whiteboard: btpp-followup-2016-05-06
Attached file refcnt_stacks.txt (obsolete) —
Another refcnt allocation log during bisecting.
Attached file balanced.txt (obsolete) —
This log shows a normal, balanced version of attachment 8745914 [details].
This object should be cycle collectable via IDBObjectStore, but I don't know why sometimes it doesn't work.
Attached file leaked_stacks.txt
I guess I've found which object is holding those IDBTransaction objects.
They are hold by CycleCollectedJSRuntime.

For some reasons IDBTransaction will be created while RecursionDepth() is 0,
but CycleCollectedJSRuntime::ProcessMetastableStateQueue never run at aRecursionDepth == 0 until its destruction, so they don't have a chance to leave the queue.

Bug 1195755 also related to this I guess.
I've tried bholley's testcase in that bug, while it does not leak nsGlobalWindow, it still caused shutdown crash.
I'll find another reduced testcase later.
The attached log is the leaked object's refcnt.log.

Do you have any suggestion, Kyle?
Attachment #8745899 - Attachment is obsolete: true
Attachment #8745911 - Attachment is obsolete: true
Attachment #8745913 - Attachment is obsolete: true
Attachment #8745914 - Attachment is obsolete: true
Attachment #8748045 - Attachment is obsolete: true
Attachment #8750164 - Attachment is obsolete: true
Flags: needinfo?(khuey)
BTW, to reproduce this on web.telegram.org, I've found an easier way:

1. open Firefox with a clean profile
2. set memory.ghost_window_timeout_seconds to 1
3. login to web.telegram.org
4. find some groups that have unread messages
5. keep clicking those groups
6. close the tab
7. open about:memory
8. GC & CC
9. measure memory usage
Can you get the stack for where we create the IDBTransaction with RecursionDepth() == 0?
Flags: needinfo?(khuey) → needinfo?(wpan)
Attached file assertion_stack.txt
I disabled the ifndef and let it crashes at MOZ_ASSERT (the code is based on the most recent mozilla-beta).
Flags: needinfo?(wpan)
It looks like wcpan and khuey are investigating this.
Flags: needinfo?(continuation)
Yuck.  We should probably do a microtask checkpoint (i.e. call http://mxr.mozilla.org/mozilla-central/source/xpcom/base/CycleCollectedJSRuntime.cpp#1368 after processing Cocoa events).  Unfortunately I don't know where that is ... probably somewhere in the Mac appshell implementation.

I *think* on other platforms this doesn't happen (i.e. we call into the native event loop from the XPCOM event loop, not the other way around) but we should verify that.
Attached file testcase.html
This is a reduced testcase.

It looks weird because the original site uses jQuery/AngularJS, that has their own event/promise implementation. This is the closest code path that I can come up with.
OS: Unspecified → Mac OS X
I should correct my previous inspection, obviously this will affect on all branches, in non-e10s mode.
In e10s mode, content processes will spin another event loop so they won't cause this problem.

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #22)
> Yuck.  We should probably do a microtask checkpoint (i.e. call
> http://mxr.mozilla.org/mozilla-central/source/xpcom/base/
> CycleCollectedJSRuntime.cpp#1368 after processing Cocoa events). 
> Unfortunately I don't know where that is ... probably somewhere in the Mac
> appshell implementation.
> 
> I *think* on other platforms this doesn't happen (i.e. we call into the
> native event loop from the XPCOM event loop, not the other way around) but
> we should verify that.

From my inspection, both CycleCollectedJSRuntime::RunInMetastableState and CycleCollectedJSRuntime::ProcessMetastableStateQueue are called from the XPCOM event loop on Linux and Windows.

So we have several options:

1. Call CycleCollectedJSRuntime::AfterProcessTask in somewhere.

I have tried this in several places (in nsAppShell::ProcessGeckoEvents ... etc.), but I always got it crashes.
Maybe I tried wrong places.

2. Tweak the event loop to make it behaves like other platforms. (i.e. always call from the XPCOM event loop)

I personally prefer this option, but I guess this will introduce some side effects, just like option 1.

3. Hack the recursion count

Set it to mRecursionDepth + 1 if it is less then mRecursionDepth.
Seems passed all tests. I think this is easier to back-port to earlier branches.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e422d5bb09fa
I think we should do #3 for the moment.

If you were waiting on a response from me before proceeding, you should have used needinfo.
OK, I'll remember that.

I've tested this patch in my daily use, didn't see any ghost window again.

Also passed tests on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=320f10229392

I'll send a review request after you are back.
Comment on attachment 8762838 [details] [diff] [review]
Bug-1261143-Hack-recursion-depth-in-metastable-state.patch

Review of attachment 8762838 [details] [diff] [review]:
-----------------------------------------------------------------

Could we include a comment describing why we need to do this?  The nested event loop situation on Mac is kind of obscure and it's not immddiately clear what symptom this code is addressing.
Is it possible to increase the recursion depth in the cocoa appshell?  I would prefer that to papering over it here ...
Flags: needinfo?(wpan)
We talked in person, and the answer is no.
Flags: needinfo?(wpan)
Comment on attachment 8762838 [details] [diff] [review]
Bug-1261143-Hack-recursion-depth-in-metastable-state.patch

Review of attachment 8762838 [details] [diff] [review]:
-----------------------------------------------------------------

Please add a comment explaining this.
Attachment #8762838 - Flags: review?(khuey) → review+
Add comment that points to this bug.
Attachment #8762838 - Attachment is obsolete: true
Attachment #8763774 - Flags: review?(khuey)
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e2b3bf3ebafe
Hack recursion depth in metastable state for Mac OS. r=khuey
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/e2b3bf3ebafe
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Assignee: nobody → wpan
Kyle, do you think it would be safe to uplift this?
Flags: needinfo?(khuey)
I would be ok with uplifting it to Aurora.
Flags: needinfo?(khuey)
Wei-Chen, do you mind getting this uplifted to aurora 49?  We want to uplift since it leads to more memory leaks on mac.
Flags: needinfo?(wpan)
Comment on attachment 8763774 [details] [diff] [review]
Bug-1261143-Hack-recursion-depth-in-metastable-state.patch

Approval Request Comment
[Feature/regressing bug #]: bug 1264566
[User impact if declined]: Mac users may suffer from memory leaks on certain sites and lead to bad experience.
[Describe test coverage new/current, TreeHerder]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=320f10229392
[Risks and why]: It does not break IndexedDB or cause crash, I think it's safe to uplift.
[String/UUID change made/needed]: None.
Flags: needinfo?(wpan)
Attachment #8763774 - Flags: approval-mozilla-aurora?
Attachment #8763774 - Flags: approval-mozilla-aurora?
Comment on attachment 8763774 [details] [diff] [review]
Bug-1261143-Hack-recursion-depth-in-metastable-state.patch

Approval Request Comment
[Feature/regressing bug #]: bug 1261143
[User impact if declined]: Mac users may suffer from memory leaks on certain sites and lead to bad experience.
[Describe test coverage new/current, TreeHerder]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=320f10229392
[Risks and why]: It does not break IndexedDB or cause crash, I think it's safe to uplift.
[String/UUID change made/needed]: None.
Attachment #8763774 - Flags: approval-mozilla-aurora?
Revised the bug number.
Tracking 49+ for this ghost windows issue.
Comment on attachment 8763774 [details] [diff] [review]
Bug-1261143-Hack-recursion-depth-in-metastable-state.patch

Regression from 45, let's uplift this memory leak fix.
Attachment #8763774 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
This would be good to verify in aurora, maybe with the test case here and the one in bug 1195755 (or known affected sites)
Flags: needinfo?(cornel.ionce)
I tried reproducing this bug on multiple stations using Fx 48.0a1, build ID 20160411030231 and 20160331030231 but could not do so. Can you please verify the fix?
Flags: needinfo?(cornel.ionce)
I can verify this on 48.0b6.

1. open attachment 8751693 [details]
2. click the button
3. close the tab
4. GC & CC in about:memory
5. measure and search bmoattachments, you should see something marked as detached
6. wait until it becomes ghost-window (somehow memory.ghost_window_timeout_seconds does not seems to work in this case)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: