Closed Bug 694254 Opened 13 years ago Closed 8 years ago

Intermittent leak of two domwindows opened by test_focus_browserui.xul in mochitest-a11y - 1549783 bytes (AsyncStatement, AtomImpl, BackstagePass, CalculateFrecencyFunction, CallbackObject, ...)

Categories

(Core :: Disability Access APIs, defect, P1)

x86
Windows 7
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: philor, Assigned: surkov)

References

(Blocks 2 open bugs)

Details

(Keywords: helpwanted, intermittent-failure, memory-leak, Whiteboard: [leave open][test disabled on Win7/Win8])

Attachments

(4 files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=6825370&tree=Mozilla-Inbound
Rev3 WINNT 6.1 mozilla-inbound debug test mochitest-other on 2011-10-12 21:51:47 PDT for push 26fff9d82083

TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1074643 bytes during test execution
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1385 instances of AtomImpl with size 20 bytes each (27700 bytes total)
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of BackstagePass with size 24 bytes
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of DR_State with size 32 bytes
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of DocumentRule with size 36 bytes
TEST-UNEXPECTED-FAIL | automationutils.processLeakLog() | leaked 1 instance of GenericFactory with size 16 bytes
(etc. etc., since it's leaking all of two domwindows)


https://tbpl.mozilla.org/leak-analysis/?id=6825370&tree=Mozilla-Inbound has the usual false-positive about "chrome://mochitests/content/chrome/dom/ipc/tests/test_process_error.xul leaked 1 DOMWINDOW(s)" and then the more likely "chrome://mochitests/content/a11y/accessible/events/test_focus_browserui.xul leaked 2 DOMWINDOW(s)".
OK, so I'm getting this leak 100% reliably with some changes that change the timing of paints.  See https://tbpl.mozilla.org/?tree=Try&rev=8996de59de28

This is blocking landing some Gecko performance improvements at the moment...
BZ what's the latest-best way to debug this?

Alexander, any ideas? Is this always 2 DOM windows at shutdown? I've not seen this locally in BloatView on Windows.
Severity: normal → major
Priority: -- → P1
BZ are you sure the leak in comment 41 was this bug?
Oh /me discovers analysis link and sees:
chrome://mochitests/content/chrome/dom/ipc/tests/test_process_error.xul leaked 1 DOMWINDOW(s)
chrome://mochitests/content/a11y/accessible/events/test_focus_browserui.xul leaked 2 DOMWINDOW(s)
> BZ what's the latest-best way to debug this?

For someone who can reproduce?  Get a refcount balance tree and see what it looks like.

So far I can't reproduce on Mac (where the test just doesn't complete at all) or on Linux... Trying Windows next.  Note that this also doesn't seem to happen on WinXP, at least for my pushes and the sample I tried of links above.
(In reply to Boris Zbarsky (:bz) from comment #45)

> So far I can't reproduce on Mac (where the test just doesn't complete at
> all) 

Mac builds don't have accessibility by default.
> Mac builds don't have accessibility by default.

Yes, but my local builds do of course.  Otherwise running the test would be pretty pointless, now wouldn't it?
> Yes, but my local builds do of course.  Otherwise running the test would be
> pretty pointless, now wouldn't it?

Just making sure.
(In reply to David Bolter [:davidb] from comment #42)

> Alexander, any ideas? Is this always 2 DOM windows at shutdown? I've not
> seen this locally in BloatView on Windows.

maybe 3 (chrome, content and iframe test) but that's not big deal I think.

Try server output of Boris's build (https://tbpl.mozilla.org/php/getParsedLog.php?id=7630811&tree=Try&full=1#error0) says that two accessible classes were leaked (nsAccDocManager and nsAccessibilitySerivce). That's the same object and accessibility doesn't control its life cycle (component manager does I think) so it doesn't sound like a11y is guilty. We'd need a help from some leak finding guru.
Doe that service hold any strong references to any windows or DOM objects?
(In reply to Boris Zbarsky (:bz) from comment #50)
> Doe that service hold any strong references to any windows or DOM objects?

Right. It keeps strong reference to accessible objects which keep strong references to DOM nodes. So if accessibility service is not notified that document goes away by any chance then it will keep the reference on corresponding accessible object that keeps a reference to DOM document.

Can you give me a link to your patch making a problem permanent so I can try it locally?
Attached patch Patch to testSplinter Review
> Can you give me a link to your patch making a problem permanent so I can try
> it locally?

Attached.  This applies on top of rev 913ee424d1d2.
It appears a11y is not guilty here even nsAccDocManager doesn't remove itself from event listeners of DOM window event target (nsPIDOMWindow::GetChromeEventHandler). That makes nsAccDocManager and nsAccessibilityService appear in leaked classes list. Usually when that event target object is destroyed then it clears registered event listeners. I tried to unregister nsAccDocManager from event listeners and nsAccDocManager and nsAccessibilityService disappeared from leaked classes. That should mean somebody else keeps DOM window object alive.
OK, so I can reproduce the leak locally on Windows.  I also see these interesting assertions about "adopting child" from this test.  Those happen even when there is no leak, but are they expected?
OK.  I can confirm comment 56: We're adding 7 refs to the acc doc manager via event listener managers, and only releasing 5 of those.

I assume the acc doc manager doesn't hold any strong refs to the document?
(In reply to Boris Zbarsky (:bz) from comment #60)
> OK, so I can reproduce the leak locally on Windows.  I also see these
> interesting assertions about "adopting child" from this test.  Those happen
> even when there is no leak, but are they expected?

No, they aren't. Iirc they were related with toolbar buttons children creation but I didn't debug that deeply yet.

(In reply to Boris Zbarsky (:bz) from comment #61)
> OK.  I can confirm comment 56: We're adding 7 refs to the acc doc manager
> via event listener managers, and only releasing 5 of those.
> 
> I assume the acc doc manager doesn't hold any strong refs to the document?

It could (nsAccDocManager.mDocAccessibleCache.Get().mDocument) but not in this case since mDocAccessibleCache is empty. Btw, do we need to take the patch that removes accDocManager from listeners?
Alright, on my local setup this test leaks quite reliably with a vanilla tree built from rev 913ee424d1d2
> Btw, do we need to take the patch that removes accDocManager from listeners?

I don't think so, as long as we cycle-collect it.
OK.  This sucks.

I tried doing leak logging on the content document we end up leaking (the one that I'm pretty sure is the root of the leaks), and that consistently changes the timing enough that the leak goes away.  :(
So.  I've managed to get the leak to go away for me both locally and on tinderbox for the moment by changing accessible/tests/mochitest/events.js so that the setTimeout call in eventQueue_processNextInvokerInTimeout uses a delay of 300ms instead of 100ms.

Why that helps, I have no idea.

Is that a change we can check in in the meantime while you continue to debug this?
(In reply to Boris Zbarsky (:bz) from comment #67)
> So.  I've managed to get the leak to go away for me both locally and on
> tinderbox for the moment by changing accessible/tests/mochitest/events.js so
> that the setTimeout call in eventQueue_processNextInvokerInTimeout uses a
> delay of 300ms instead of 100ms.
>
> Why that helps, I have no idea.
> 
> Is that a change we can check in in the meantime 

that makes a11y tests slower, nothing else so no real objections

> while you continue to debug
> this?

Who could be that you? If it doesn't look like a failure in a11y code then I don't feel myself a right person to be assignee. I bet there are people at Mozilla quite experienced on leak finding.
> that makes a11y tests slower,

How was the "100" value selected?  Should I just experiment with various other values here?

> Who could be that you?

Well, for starters someone who understands what eventQueue_processNextInvokerInTimeout is doing.

> If it doesn't look like a failure in a11y code

That's not clear to me at all, for what it's worth.

> I bet there are people at Mozilla quite experienced on leak finding.

Sure.  But any attempt to use the usual tools here fails because they end up affecting the timing and the bug disappears.  See comment 66.

So what's needed here next, I suspect, is to take this a11y testcase and reduce it to a bare minimum that still reproduces the leak, so that we can try to get an idea of what might really be leaking.  And that's most easily done by someone who understands the testcase.
(In reply to Boris Zbarsky (:bz) from comment #69)
> > that makes a11y tests slower,
> 
> How was the "100" value selected?  Should I just experiment with various
> other values here?

It's just some value to wait for unexpected events. You can tweak it.

> Sure.  But any attempt to use the usual tools here fails because they end up
> affecting the timing and the bug disappears.  See comment 66.
> 
> So what's needed here next, I suspect, is to take this a11y testcase and
> reduce it to a bare minimum that still reproduces the leak, so that we can
> try to get an idea of what might really be leaking.  And that's most easily
> done by someone who understands the testcase.

Ok, I'll try to figure out minimal testcase.
200ms also works, by the way.  150ms does not.  Should I just make this change?
(In reply to Boris Zbarsky (:bz) from comment #71)
> 200ms also works, by the way.  150ms does not.  Should I just make this
> change?

that makes sense
Whiteboard: [orange]
Not permaorange, that would make it far too easy to spot the regression, but I retriggered a few on https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=3bf0fc40df43 because I'm suspicious like that.

https://tbpl.mozilla.org/php/getParsedLog.php?id=18951243&tree=Mozilla-Inbound
Keywords: mlk
Summary: Intermittent leak of two domwindows opened by test_focus_browserui.xul in mochitest-a11y → Intermittent leak of two domwindows opened by test_focus_browserui.xul in mochitest-a11y - 1549783 bytes (1 AsyncStatement, 1689 AtomImpl, 1 BackstagePass, 1 CalculateFrecencyFunction, 9 CallbackFunction)
I filed bug 846204 to make DocManager to unregister itself from DOM event listeners (updated patch ' unregister nsAccDocManager from event listeners ')
(In reply to alexander :surkov from comment #199)
> I filed bug 846204 to make DocManager to unregister itself from DOM event
> listeners (updated patch ' unregister nsAccDocManager from event listeners ')

Thank you :-)
Depends on: 846204
No, it just feels like it.
Please can you find an owner for this - it's ~ top 5 Orange at the moment.
Flags: needinfo?(dbolter)
(In reply to Ed Morley [:edmorley UTC+0] from comment #231)
> Please can you find an owner for this - it's ~ top 5 Orange at the moment.

I didn't get a minimal leaking test case (comment #69) and it doesn't seem a11y related (a11y doesn't appear in leak logs after bug 846204 is fixed) so my recent work to fix thin places where a11y can leak isn't related to this bug. Not sure how else I can help.
(In reply to alexander :surkov from comment #233)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #231)
> > Please can you find an owner for this - it's ~ top 5 Orange at the moment.
> 
> I didn't get a minimal leaking test case (comment #69) and it doesn't seem
> a11y related (a11y doesn't appear in leak logs after bug 846204 is fixed) so
> my recent work to fix thin places where a11y can leak isn't related to this
> bug. Not sure how else I can help.

It only occurs during mochitest-a11y; so I don't see how this isn't mochitest-a11y related?
(In reply to Ed Morley [:edmorley UTC+0] from comment #234)

> It only occurs during mochitest-a11y; so I don't see how this isn't
> mochitest-a11y related?

I said it's likely not a11y related, I didn't say it's not mochitest-a11y related.
Sounds like bug 846204 will fix this one so putting Alexander as owner for now.
[cc+ mccr8 as our resident leak guru in case he notices something]
Flags: needinfo?(dbolter)
Assignee: nobody → surkov.alexander
(In reply to David Bolter [:davidb] from comment #237)
> Sounds like bug 846204 will fix this one so putting Alexander as owner for
> now.

it won't fix this one (numbers will be changed).
(In reply to alexander :surkov from comment #236)
> (In reply to Ed Morley [:edmorley UTC+0] from comment #234)
> 
> > It only occurs during mochitest-a11y; so I don't see how this isn't
> > mochitest-a11y related?
> 
> I said it's likely not a11y related, I didn't say it's not mochitest-a11y
> related.

Yeah I realise, but I still don't see how this isn't a11y related then.
https://tbpl.mozilla.org/php/getParsedLog.php?id=20578977&tree=Cedar (can't even hope to escape it by waiting until we only run tests on Win8 and higher, since that's Win8)
Does this happen about 1/4 of the time on Win7 Moth?  I ask because I did a try push that just added some printfs, and I got this a bunch of times, so I'm wondering if I was just unlucky or if somehow the printfs caused more of this...
https://tbpl.mozilla.org/?tree=Try&rev=07c5f12d7ebd
(In reply to Andrew McCreight [:mccr8] from comment #311)
> Does this happen about 1/4 of the time on Win7 Moth?  I ask because I did a
> try push that just added some printfs, and I got this a bunch of times, so
> I'm wondering if I was just unlucky or if somehow the printfs caused more of
> this...

I suspect this is very timing dependent, iirc bz managed to make it more or less perma orange several months back with some random layout patch, so it may well be your printfs...
Not permaorange, but when I looked the other day, it was about 60% of the Windows 7 and 8 debug runs leaking.

Probably be a good idea to prepare a patch to disable test_focus_browserui.xul only on debug Win7 and Win8, because if a sheriff winds up rage-disabling it, it'll more likely be on opt+debug and either all Windows, or all platforms.
I pushed remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/9bfca2900825 to see if it fixes the leak or if it just starts happening some place else
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/78e874a8d473 - clearly the reviewer was an idiot. I won't be surprised if it breaks on Windows, since that "return; to nowhere" probably isn't going to stop the test, but I don't have any idea why it's broken on non-Windows, claiming that no tests were run.
Whiteboard: [leave open][test disabled on Win7/Win8]
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/9c7dea4aed4b - that was closer, but because it doesn't call SimpleTest.finish() or close the window on Windows, it just sits around until it times out.

One of the reasons rage-disables tend to just disable everywhere is because it's so much easier and more reliable :)
(In preparation for bug 850681)
Summary: Intermittent leak of two domwindows opened by test_focus_browserui.xul in mochitest-a11y - 1549783 bytes (1 AsyncStatement, 1689 AtomImpl, 1 BackstagePass, 1 CalculateFrecencyFunction, 9 CallbackFunction) → Intermittent leak of two domwindows opened by test_focus_browserui.xul in mochitest-a11y - 1549783 bytes (AsyncStatement, AtomImpl, BackstagePass, CalculateFrecencyFunction, CallbackFunction)
Summary: Intermittent leak of two domwindows opened by test_focus_browserui.xul in mochitest-a11y - 1549783 bytes (AsyncStatement, AtomImpl, BackstagePass, CalculateFrecencyFunction, CallbackFunction) → Intermittent leak of two domwindows opened by test_focus_browserui.xul in mochitest-a11y - 1549783 bytes (AsyncStatement, AtomImpl, BackstagePass, CalculateFrecencyFunction, CallbackObject, ...)
Was this permaorange on aurora before the last merge, or did it just become permaorange on beta?
All of the instances of this in the last month have been on Mozilla-Release or Beta, which is odd.
The test was disabled on trunk and has been riding the trains, so not too surprising :-)
Ah, yeah. :)
Blocks: 933741
Last seen in 2013.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: