Closed Bug 1223445 Opened 9 years ago Closed 9 years ago

KeyframeEffectReadOnly objects end up keeping lots of other objects alive too long

Categories

(Core :: DOM: Animation, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
firefox44 + fixed
firefox45 + fixed
b2g-v2.5 --- fixed

People

(Reporter: wlach, Assigned: smaug)

References

Details

(Keywords: memory-leak, Whiteboard: [MemShrink])

Attachments

(2 files)

So I'd noticed that over the past while, Nightly would be pretty janky after a day of usage. I called :mconley over (:vladan also joined) and we pinned it down to the fact that I had multiple treeherder tabs opened consuming over 3G of memory, thus causing the cycle collector to aggressively collect memory and cause periodic pauses to the browser. Here's a profile:

http://people.mozilla.org/~bgirard/cleopatra/#report=ddbcb94237c91cc0ff36766775b6d15f299a9ee6

I know there's a bug in treeherder somewhere (which I'll try and investigate seperately) and there's only so much we can do about badly performing websites, but perhaps we should at least warn about this sort of case somehow, similar to how we warn about other types of things that make the browser perform slowly?

CC'ing various people I'm told would have insight into this (and/or the ability to fix it)
There's a few things here:

- the treeherder web page likely keeps references to DOM elements longer than it should
- could CC somehow do its walks in smaller increments?
- about:performance should be extended to cover GC & CC (but there's no time for this in Perf-team in Q4 or Q1)
- once that is done, the Slow Addon Watcher should be extended to cover web pages (Page Watcher), but again, no time in Q4/Q1
- I don't think BHR captured all the janks
- it would be neat to have about:telemetry smart enough to point out performance problems using the collected histogram data (e.g. in this case CYCLE_COLLECTOR_MAX_PAUSE)
So the STR for this is "keep treeherder open in several tabs"?
CC is incremental (default slice is 5ms) and we do try to optimize out certainly-alive objects from the graph.

Looking the profile now...
Hmm, looks like the profile is missing most of the symbols.
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #1)
> - could CC somehow do its walks in smaller increments?

The cycle collector is already mostly incremental. I'll have to try to reproduce this to see what exactly is taking a long time. Unfortunately unlike the GC we don't have detailed profiling for the CC. I guess the cleopatra profile might indicate that, but I find the UI incomprehensible.

The big CC optimization we do is marking objects as being definitely alive. This may be failing here in a way that could be improved. To figure that out, I'd need a CC log (about:memory -> save concise) to see what all is in the log.
Component: JavaScript: GC → XPCOM
Oh, maybe JS: GC is right after all, since the long pause times are GC, not CC.
Component: XPCOM → JavaScript: GC
(maybe. That profile is hard to interpret)
Something is keeping lots of 
"root FragmentOrElement (xhtml) button class='btn  group-btn btn-xs job-group-count btn-dkgray-count filter-shown' (orphan)" elements alive. They are not in any document (they are orphan), and most
of them are the root objects of a cycle of 384 objects.
Ok, this was a bit painful to debug since treeherder is so heavy page that using debug build was hopeless.
anyhow, those html:button elements have 5 references of which cycle collector normally knows about 4:
- JS wrapper owns the native object
- element's parent owns its child node
- elements child owns its parent node
- nsDOMTokenList (classList) owns its element (and element owns its classList)

The missing 5th reference is coming from KeyframeEffectReadOnly.
Either the page is keeping KeyframeEffectReadOnly objects alive too long, or we have a runtime leak.
Since the page doesn't seem to create any JS wrappers for KeyframeEffectReadOnly objects, I'd say this is a
leak somewhere in AnimationManager.
Component: JavaScript: GC → DOM: Animation
AnimationCollection looks very suspicious here.
Summary: Badly performing web pages can cause cycle collector pauses that are hard to diagnose → KeyframeEffectReadOnly objects end up keeping lots of other objects alive too long
So, if I'm reading the code right, elements have a "property" (SetProperty) which points to AnimationCollection, that collection has array of 
RefPtr<dom::Animation> objects which then have strong refs to KeyframeEffectReadOnly objects which have
nsCOMPtr<Element> mTarget. So, nice little cycle there.
It is broken manually when CommonAnimationManager::RemoveAllElementCollections() is called, since
that goes through all the collections and calls Destroy on them.
That happens when null PresShell is passed to nsPresContext::SetShell.
I guess we could just traverse/unlink those properties. a patch coming.
Hmm, no, that isn't enough.
I noticed that at some point also animation timeline seemed to keep objects alive too long.
Looks like DocumentTimeline removes animations from its hashtable only when refresh driver is about to tick. And on background tabs that may not ever happen.
Thanks for looking into this.

With regards to the timeline, current AnimationTimeline never removes animations from its list. The DocumentTimeline subclass, however, will check on each tick which animations it needs to keep and drop the rest. So perhaps if we're not getting ticks I guess that will never happen. (Perhaps NotifyRefreshDriverDestroying could take care of that.)
[Tracking Requested - why for this release]:
Rather bad runtime leak causing a lot higher cc and gc times than normally - in other words causing jank.


(Not yet sure if this is happening also in 43.)
Keywords: mlk
Whiteboard: [MemShrink]
(In reply to Brian Birtles (:birtles, busy 6-17 Nov) from comment #16)
> With regards to the timeline, current AnimationTimeline never removes
> animations from its list. The DocumentTimeline subclass, however, will check
> on each tick which animations it needs to keep and drop the rest. So perhaps
> if we're not getting ticks I guess that will never happen. (Perhaps
> NotifyRefreshDriverDestroying could take care of that.)

It looks like NotifyRefreshDriverDestroying is only called when a presshell is destroyed. But this seems to be about background tabs that have a presshell (and a refresh driver), but aren't getting ticks. Or am I missing something?
(In reply to Timothy Nikkel (:tn) from comment #18)
> (In reply to Brian Birtles (:birtles, busy 6-17 Nov) from comment #16)
> > With regards to the timeline, current AnimationTimeline never removes
> > animations from its list. The DocumentTimeline subclass, however, will check
> > on each tick which animations it needs to keep and drop the rest. So perhaps
> > if we're not getting ticks I guess that will never happen. (Perhaps
> > NotifyRefreshDriverDestroying could take care of that.)
> 
> It looks like NotifyRefreshDriverDestroying is only called when a presshell
> is destroyed. But this seems to be about background tabs that have a
> presshell (and a refresh driver), but aren't getting ticks. Or am I missing
> something?

Sorry, I posted that comment at the same time as Olli so I didn't realise this was about background tabs.

We discussed this on IRC and I think we need to eagerly remove Animations from their AnimationTimeline in Animation::Cancel()--and that probably requires making AnimationTimelines store a linked-list of Animations instead of an array.
That ::DoCancel change still isn't enough, since there is after all the strong ref from timeline to animation, so nothing ends up calling cancel (even though I would assume styling code to call CancelFromStyle).
We should be calling CancelFromStyle when we run UnbindFromTree here:

  https://dxr.mozilla.org/mozilla-central/rev/84a7cf29f4f14c9b359db2f7f19c0abd6a8e178e/dom/base/Element.cpp#1801

Since calling DeleteProperty should trigger AnimationCommon::PropertyDtor.
Attached patch v2Splinter Review
So there is yet another leak. We never remove Animations in background tabs 
from PendingAnimationTracker (unless the page is closed or becomes foreground).
The reason is that when we try to remove animation from tracker,
GetComposedDoc already returns null. So Element::UnbindFromTree needs to remove properties earlier.

The other leak fix is the mTimeline->RemoveAnimation(this); call in Animation::DoCancel()

LinkedList handling is a bit ugly, but I blame our LinkedList API.
In order to still keep a strong reference to the relevant Animation objects, AnimationSet is now 
nsTHashtable<nsRefPtrHashKey<dom::Animation>>


https://treeherder.mozilla.org/#/jobs?repo=try&revision=453433462a2b



I realized we don't need to traverse the properties atm after all, since the remove them anyway
when closing the relevant page or unbinding element.
But it is not clear to me why Element::UnbindFromTree handling is actually right, and why we shouldn't just
rely on cycle collector. But that is about some other bug.
Assignee: nobody → bugs
Attachment #8686839 - Flags: review?(bbirtles)
Comment on attachment 8686839 [details] [diff] [review]
v2

Sorry for the delay on this one (we had a big Firefox DevCon here over the weekend), and thanks for doing this.

>diff --git a/dom/animation/AnimationTimeline.h b/dom/animation/AnimationTimeline.h
>--- a/dom/animation/AnimationTimeline.h
>+++ b/dom/animation/AnimationTimeline.h
>@@ -86,29 +89,30 @@ public:
>   /**
>    * Inform this timeline that |aAnimation| which is or was observing the
>    * timeline, has been updated. This serves as both the means to associate
>    * AND disassociate animations with a timeline. The timeline itself will
>    * determine if it needs to begin, continue or stop tracking this animation.
>    */
>   virtual void NotifyAnimationUpdated(Animation& aAnimation);
> 
>+  void RemoveAnimation(Animation* aAnimation);
>+

Perhaps we should add a comment explaining why this isn't called more often. e.g. "Typically animations that don't currently require time updates (e.g. because they have finished or are paused) are automatically removed from the timeline on each tick. However, we eagerly remove animations that have been cancelled since the next tick might never come if we are in a background tab, but that shouldn't prevent us from freeing up references to these animations."

Actually, I've now noticed we use this method in a few other places so perhaps this comment isn't necessary. I'll leave it up to you.

>   // Animations observing this timeline
>   //
>   // We store them in (a) a hashset for quick lookup, and (b) an array
>   // to maintain a fixed sampling order.
>   //
>   // The array keeps a strong reference to each animation in order
>   // to save some addref/release traffic and because we never dereference
>   // the pointers in the hashset.
>-  typedef nsTHashtable<nsPtrHashKey<dom::Animation>> AnimationSet;
>-  typedef nsTArray<RefPtr<dom::Animation>>         AnimationArray;
>-  AnimationSet   mAnimations;
>-  AnimationArray mAnimationOrder;
>+  typedef nsTHashtable<nsRefPtrHashKey<dom::Animation>> AnimationSet;
>+  AnimationSet mAnimations;
>+  LinkedList<dom::Animation> mAnimationOrder;

This comment needs to be updated to say that it's the hashset that keeps the strong ref. (The original comment seems to be missing some words in the middle saying that we store raw pointers in the array "... to save some addref/release traffic ..." so feel free to completely rewrite that comment!)

Thanks again.
Attachment #8686839 - Flags: review?(bbirtles) → review+
https://hg.mozilla.org/mozilla-central/rev/61455317cc85
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
Comment on attachment 8688008 [details] [diff] [review]
comment tweaked

Approval Request Comment
[Feature/regressing bug #]: I don't know yet. The Element::UnbindFromTree part is really old 
[User impact if declined]: runtime leak until background tabs are closed
[Describe test coverage new/current, TreeHerder]: NA, this is quite hard to test
and needs one to create/analyze cycle collector logs
[Risks and why]: low risk, releasing to-be-deleted objects earlier
[String/UUID change made/needed]: NA

I need to check if we have the bug also in beta.
Attachment #8688008 - Flags: approval-mozilla-aurora?
(this could be a regression from bug 1208385, which is nightly/aurora only)
Haven't managed to reproduce on beta.
Comment on attachment 8688008 [details] [diff] [review]
comment tweaked

This code has been in Nightly for a few days so should we be safe to uplift to Aurora44.
Attachment #8688008 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: