Closed Bug 1170049 Opened 9 years ago Closed 7 years ago

crash in mozilla::a11y::DocAccessibleParent::Destroy()

Categories

(Core :: Disability Access APIs, defect)

39 Branch
Unspecified
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
e10s - ---
firefox44 + wontfix
firefox45 - unaffected
firefox46 - unaffected
firefox-esr38 --- unaffected
firefox-esr45 --- unaffected
firefox-esr52 --- unaffected

People

(Reporter: MatsPalmgren_bugz, Assigned: tbsaunde)

References

Details

(5 keywords, Whiteboard: [aes+][not just e10s -- top crash in release w/out e10s])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-2225e065-2b94-4891-bd26-817932150518.
=============================================================


Crashes in the past 28 days (all products and versions):
Product 	Version 	Percentage 	Number Of Crashes
Firefox 	41.0a1 	61.96 %	101
Firefox 	40.0a2 	17.18 %	28
Firefox 	39.0a2 	12.27 %	20
Firefox 	40.0a1 	4.91 %	8
Firefox 	39.0b1 	3.68 %	6


Stack:

mozilla::a11y::DocAccessibleParent::Destroy()
mozilla::a11y::DocAccessibleParent::Destroy()
mozilla::a11y::DocAccessibleParent::ActorDestroy(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason)
mozilla::a11y::PDocAccessibleParent::OnMessageReceived(IPC::Message const&)
mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&)
mozilla::ipc::MessageChannel::OnMaybeDequeueOne()
RunnableMethod<mozilla::ipc::MessageChannel, bool ( mozilla::ipc::MessageChannel::*)(void), Tuple0>::Run()
MessageLoop::DoWork()
mozilla::ipc::DoWorkRunnable::Run()
nsThread::ProcessNextEvent(bool, bool*)
putting back into e10s triage, this has made it's way into top 10 in crash volume on Nightly
Keywords: topcrash-win
Blocks: e10sa11y2
Hey Trevor, looking at the code in DocAccessibleParent::Destroy, I'm pretty confused. First, we assert that mChildDocs is empty, then we attempt to call destroy on all of the child docs anyway? Even with that, however, the lop itself is poorly constructed:

  for (uint32_t i = childDocCount - 1; i < childDocCount; i--)

It looks like it's half written to iterate backwards and half forwards, but either way is not correct if the array is empty, as the function asserts. Would you mind taking a look and figuring out what's intended?
Flags: needinfo?(tbsaunde+mozbugs)
(In reply to Blake Kaplan (:mrbkap) (please use needinfo!) from comment #2)
> Hey Trevor, looking at the code in DocAccessibleParent::Destroy, I'm pretty
> confused. First, we assert that mChildDocs is empty, then we attempt to call
> destroy on all of the child docs anyway? Even with that, however, the lop

Well, we need to destroy the child docs because I'm pretty sure we've seen crashes in the wild where that didn't happen.  However it seems kind of bad that's happening and I'd like to fix it if we find a test case so the assert is there in the hope a fuzzer will hit it, or someone running a debug build might find a test case to look at.

> itself is poorly constructed:
> 
>   for (uint32_t i = childDocCount - 1; i < childDocCount; i--)
> 
> It looks like it's half written to iterate backwards and half forwards, but
> either way is not correct if the array is empty, as the function asserts.
> Would you mind taking a look and figuring out what's intended?

actually I believe that's correct, but its kind of confusing since you need to remember unsigned arithmetic is mod 2^32 so 0 - 1 is 2^31 - 1 not -1 (well they are the same in z_2^32 but ;))
Flags: needinfo?(tbsaunde+mozbugs)
Assignee: nobody → tbsaunde+mozbugs
Keywords: leave-open
Attachment #8643710 - Flags: review?(lorien) → review+
Checked about 20 stacks, couldn't find any with "ActorDestroy".

Trevor, maybe we should turn CheckDocTree into a boolean and return false instead of crashing. Then at the callsite crash if it returns false? This would make it a lot easier to see where we are crashing in socorro.
Flags: needinfo?(tbsaunde+mozbugs)
(In reply to David Bolter [:davidb] from comment #8)
> Checked about 20 stacks, couldn't find any with "ActorDestroy".
> 
> Trevor, maybe we should turn CheckDocTree into a boolean and return false
> instead of crashing. Then at the callsite crash if it returns false? This
> would make it a lot easier to see where we are crashing in socorro.

that's fine with me, but I'm kind of in the middle of fixing 1189277 for another day or two.
Flags: needinfo?(tbsaunde+mozbugs)
I noticed that this crash is still occurring a lot on aurora 42. Has the investigation here stalled?
Flags: needinfo?(tbsaunde+mozbugs)
Flags: needinfo?(dbolter)
(In reply to David Major [:dmajor] from comment #10)
> I noticed that this crash is still occurring a lot on aurora 42. Has the
> investigation here stalled?

well, I'm trying to add more checking but that runs into marionette-e10s crashes (hopefully the root cause of this?).  However the marionette  tests are a giant pain to deal with so I'm still in the middle of that :(

of course being able to reproduce this would make it so much easier :/
Flags: needinfo?(tbsaunde+mozbugs)
I don't have more to add but thanks for the ping. Let's see if we can get somewhere in the next couple of days.
Flags: needinfo?(dbolter)
Crash Signature: [@ mozilla::a11y::DocAccessibleParent::Destroy()] → [@ mozilla::a11y::DocAccessibleParent::Destroy()] [@ mozilla::a11y::DocAccessibleParent::Destroy]
I do see the stack from comment 13 occasionally (e.g. bp-e1842f86-9785-4f93-88b3-4ce5e2160110), but most of them now look like bp-31b96390-29a1-44e6-add8-dbd4f2160111 and are attempting to read the address  0xffffffffe5e5e5e5. The e5e5 marker is one we use for deleted memory so this is a sign of a potentially vulnerable use-after-free. In older builds the marker was 0x5a5a5a5a as in bp-d94f0945-8022-45b5-8db8-a88dd2160111
Group: layout-core-security
Whiteboard: not just e10s -- top crash in release w/out e10s
Summary: [e10s] crash in mozilla::a11y::DocAccessibleParent::Destroy() → crash in mozilla::a11y::DocAccessibleParent::Destroy()
Group: layout-core-security → dom-core-security
We are getting around 4000 crashes per week for this signature, #29 on the topcrash list. Sounds like this was a regression from 39. 
We should track this since it's sec-high. I think we haven't been tracking because it is a crash we only, or mostly, see on release with each new version. 

Marking wontfix for 44.
Trev is it worth poking Jesse (or similar) to try to recreate via fuzzing?
Flags: needinfo?(tbsaunde+mozbugs)
Group: dom-core-security → core-security-release
So first I thought we're missing to keep some strong member variable alive while setting it to null elsewhere, but then
nsTArray<DocAccessibleParent*> mChildDocs; and DocAccessibleParent* mParentDoc;
So, what is the lifetime management supposed to be here.
Oh... Parent side seems to expect that is always gets 'Shutdown' message. But of course nothing guarantees that. Child process may just die.
(In reply to Olli Pettay [:smaug] (high review load) from comment #18)
> Oh... Parent side seems to expect that is always gets 'Shutdown' message.
> But of course nothing guarantees that. Child process may just die.

no, it expects either Shutdown message is sent or ActorDestroy() is called, and afaik the latter should happen for all actors always.

basically the child owns the DocAccessibleParent.

ways the document tree can change:
PDocAccessible constructor message can add subtrees.
Hide message can remove a subtree from the main one through ProxyAccessible::Shutdown()
BindChildDoc message can add a subtree that is new or was removed by Hide message.
Shutdown message / ActorDestroy can remove subtrees.

I'd really like to know why we've never found this with fuzzers, and we really need to have non 0 test coverage for a11y with e10s.  Even this little bit of caching has been really buggy mostly because of missing events, and that's only going to be much worse with more stuff cached.
Flags: needinfo?(tbsaunde+mozbugs)
(In reply to Trevor Saunders (:tbsaunde) from comment #19)
> (In reply to Olli Pettay [:smaug] (high review load) from comment #18)
> > Oh... Parent side seems to expect that is always gets 'Shutdown' message.
> > But of course nothing guarantees that. Child process may just die.
> 
> no, it expects either Shutdown message is sent or ActorDestroy() is called,
> and afaik the latter should happen for all actors always.
Ah, there. Missed that called for Shutdown.

> basically the child owns the DocAccessibleParent.
but parent link is raw.


> 
> ways the document tree can change:
> PDocAccessible constructor message can add subtrees.
> Hide message can remove a subtree from the main one through
> ProxyAccessible::Shutdown()
> BindChildDoc message can add a subtree that is new or was removed by Hide
> message.
> Shutdown message / ActorDestroy can remove subtrees.
So we're missing some case here.
Otherwise mChildDocs wouldn't have a pointer to apparently dead object.
> > ways the document tree can change:
> > PDocAccessible constructor message can add subtrees.
> > Hide message can remove a subtree from the main one through
> > ProxyAccessible::Shutdown()
> > BindChildDoc message can add a subtree that is new or was removed by Hide
> > message.
> > Shutdown message / ActorDestroy can remove subtrees.
> So we're missing some case here.
> Otherwise mChildDocs wouldn't have a pointer to apparently dead object.

or one of those things behaves improperly in some case.  Either way there is a bug somewhere in how we keep the tree of documents up to date.

we can add more asserts that are on in nightly to check when the tree gets out of date, but given the othertree update bugs I'm not sure how many actually run with e10s on so I'm not it would help much.  Hopefully geting tests working will find a test case, or fixing those other bugs will make it clearer what the problem is.
Looks like it is fixed in 45 & 46 now. No crashes.
Marking fixed per comment 22.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Keywords: leave-openregression
Resolution: FIXED → WORKSFORME
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Whiteboard: not just e10s -- top crash in release w/out e10s → aes+
Whiteboard: aes+ → [aes+][not just e10s -- top crash in release w/out e10s]
Crash Signature: [@ mozilla::a11y::DocAccessibleParent::Destroy()] [@ mozilla::a11y::DocAccessibleParent::Destroy] → [@ mozilla::a11y::DocAccessibleParent::Destroy()] [@ mozilla::a11y::DocAccessibleParent::Destroy] [@ PLDHashTable::Iterator::Iterator | mozilla::a11y::DocAccessibleParent::Destroy]
Trevor is the diagnostic assert in DocAccessibleParent::Destroy() helping?
Flags: needinfo?(tbsaunde+mozbugs)
In the last month we have 1536 crash reports with 118 hitting the assert 'MOZ_DIAGNOSTIC_ASSERT(mChildDocs[i] != mChildDocs[j])', all in 54a1 within build id range: 20170217030229 to 20170302030206 (Feb 17 to March 2).

e.g. Windows: https://crash-stats.mozilla.com/report/index/92b9fd55-f4eb-4a2f-981b-0d3dc2170303
e.g. Linux: https://crash-stats.mozilla.com/report/index/80c92b13-fd92-4469-8805-244ac2170303
Wondering if this has resolved, and what fixed it after March 2?
Trevor suggests this was fixed via bug 1340579 which lines up. Note: that work is a mitigation and not a root cause fix (which dparks is hunting). I don't think we need to open this bug yet.
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Flags: needinfo?(tbsaunde+mozbugs)
Resolution: --- → FIXED
Given the whiteboard comment, do we need to consider backporting something to stable branches here?
Flags: needinfo?(dbolter)
Yes. Suspected fixes are bug 1332690 and bug 1340579 - I've requested authors to request uplifts.
Flags: needinfo?(dbolter)
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: