Closed Bug 1145724 Opened 9 years ago Closed 9 years ago

Crash in mozilla::a11y::TreeWalker::TreeWalker

Categories

(Core :: Disability Access APIs, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
blocking-b2g 2.2+
Tracking Status
firefox37 --- wontfix
firefox38 --- wontfix
firefox39 --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: ntroast, Assigned: yzen)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf priority: p2][CR 811496][caf-crash 584])

Attachments

(7 files, 1 obsolete file)

The following crash signature was observed. cafbot will upload the decoded minidump and extra file

[@ mozilla::a11y::TreeWalker::TreeWalker | nsAccessibilityService::ContentRemoved | mozilla::ElementRestyler::SendAccessibilityNotifications | mozilla::ElementRestyler::RestyleChildren ]

Related Crash: https://crash-stats.mozilla.com/report/index/dd9a84b3-488d-4957-b33c-5f6c22150315
Hi David,

Please assign someone on your team to help to root-cause this crash or route this to the appropriate team.

Thanks!
Mike
Component: General → Disability Access APIs
Flags: needinfo?(dbolter)
Product: Firefox OS → Core
Additional info from CAF (Code Aurora Foundation) regarding urgency:


On Mar 20, 2015, at 2:53 PM, Kumar, Indrajeet <ikumar@codeaurora.org> wrote:

Hi Mike,
   Can you please find someone to take a look at https://bugzilla.mozilla.org/show_bug.cgi?id=1145724
As per the description Mozilla has also seen it earlier and the fix could be as simple as a null check.

This crash has happened 20 times already in our stability testing in just one run so we need to fix it asap.
Current MTBF is 4 and needs to reach 30+ in a week!

Thanks,
-Inder
(In reply to Nicholas Troast [:ntroast] from comment #0)

> Related Crash:
> https://crash-stats.mozilla.com/report/index/dd9a84b3-488d-4957-b33c-
> 5f6c22150315

Hmmm, this looks like a Firefox 23 crash on Windows NT and I couldn't find any similar reports. There is a PluginTimerCallBack::Notify call in the stack which I'm not familiar with...

Alex any ideas?
Flags: needinfo?(dbolter) → needinfo?(surkov.alexander)
My drive-by guess is we're getting a nullptr for aContext at https://dxr.mozilla.org/mozilla-central/source/accessible/base/TreeWalker.cpp#24

The segfault in attachment 8580822 [details] is quite close to 0 and the constructor's initialization list dereferences aContext->Document() without a NULL check.

nsAccessibilityService::ContentRemoved() isn't checking for NULL before creating a TreeWalker?

TreeWalker will assert a non-NULL aContext at https://dxr.mozilla.org/mozilla-central/source/accessible/base/TreeWalker.cpp#27, even though it'll never get a chance to make that assertion due to the potential nullptr dereference on line 24, which we have managed to hit?
Whiteboard: [caf-crash 584]
(In reply to Michael Vines [:m1] [:evilmachines] (PTO 3/21-4/4) from comment #6)
> My drive-by guess is we're getting a nullptr for aContext at
> https://dxr.mozilla.org/mozilla-central/source/accessible/base/TreeWalker.
> cpp#24
> 
> The segfault in attachment 8580822 [details] is quite close to 0 and the
> constructor's initialization list dereferences aContext->Document() without
> a NULL check.
> 
> nsAccessibilityService::ContentRemoved() isn't checking for NULL before
> creating a TreeWalker?

I believe this is correct, though I'm not sure if it should be possible for doc->GetAccessibleOrContainer() to return nullptr in this case.

I wish someone would debug why GetAccessibleOrContainer returns null, but I doubt I'm going to get to it so if you want to take a patch for release branches adding a null check I can live with that.
(In reply to Trevor Saunders (:tbsaunde) from comment #7)
> 
> I wish someone would debug why GetAccessibleOrContainer returns null, but I
> doubt I'm going to get to it so if you want to take a patch for release
> branches adding a null check I can live with that.

We've only been able to reproduce this in monkey testing I believe, so no better STR than run monkey for hours until it crashes.  However we can easily bring in debug patches that add more logging to gather additional data around the event.  Integrating bug 1043112 into our builds could really help here too, but we're not there yet.  So logcat tracing it is for now :(

Yes, a patch with moar null checking would be a big help for the b2g v2.2 release!  Then we can resume the hunt for other crashes that this one may be masking.
Whiteboard: [caf-crash 584] → [CR 811496][caf-crash 584]
Whiteboard: [CR 811496][caf-crash 584] → [caf priority: p2][CR 811496][caf-crash 584]
Whiteboard: [caf priority: p2][CR 811496][caf-crash 584] → [b2g-crash][caf priority: p2][CR 811496][caf-crash 584]
Keywords: crash
(In reply to Michael Vines [:m1] [:evilmachines] (PTO 3/21-4/4) from comment #8)
> (In reply to Trevor Saunders (:tbsaunde) from comment #7)
> > 
> > I wish someone would debug why GetAccessibleOrContainer returns null, but I
> > doubt I'm going to get to it so if you want to take a patch for release
> > branches adding a null check I can live with that.
> 
> We've only been able to reproduce this in monkey testing I believe, so no
> better STR than run monkey for hours until it crashes.  However we can
> easily bring in debug patches that add more logging to gather additional
> data around the event. 

can you enable "tree" logging pls? see http://asurkov.blogspot.ca/2012/10/debugging-firefox-accessibility.html
Flags: needinfo?(surkov.alexander)
> can you enable "tree" logging pls? see
> http://asurkov.blogspot.ca/2012/10/debugging-firefox-accessibility.html

Hi Alex, how much extra logging will this generate?  I want to know if it's safe to turn this on in more than just an engineering build.
a lot of logging, every content insertion/deletion/some style changes will generate into it, so it'd be good to restrict that output as much as possible
blocking-b2g: 2.2? → 2.2+
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #15)
> Why is a11y enabled at all?

I'm not sure what "monkey testing" involves, but maybe the screen reader got enabled?  but I'm not really sure and it is a good question.
I'm reverting this back to ARM/gonk and I think the stack in comment 0 is unrelated.
OS: Linux → Gonk (Firefox OS)
Hardware: x86_64 → ARM
Note I meant unrelated as in not the actual crash instances we're talking about here (since that crash was on windows). The fact that contentremoval is in (all?) the stack(s) may indeed be related.
Nicholas,

Please respond, or have someone at CAF who's able to, respond to Kyle's comment #15:

> Why is a11y enabled at all?


and Trevor's comment #16:

> I'm not sure what "monkey testing" involves, but maybe the screen reader got
> enabled?  but I'm not really sure and it is a good question.


Our Disability Access APIs team (:dbolter) believes a11y should only be enabled by users' specific action via the settings view; cc'ing :eeejay + :yzen for confirmation.

Thanks,
Mike
Flags: needinfo?(ntroast)
If a11y can be turned on through settings, then I think it can turned on by monkey, too.  Monkey is just pressing different areas of the screen at random.

However, I think this particular crash was found by stability test team (through scripting or manual test).  Whether they turned a11y on or not, I'm not sure.  I'll try to find out more.  Hopefully you can carry on with the analysis while I find out, thanks.
Flags: needinfo?(ntroast)
Another interesting question is whether this code works in the presence of web components.
add ni for myself for tracking.
Flags: needinfo?(sku)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #21)
> Another interesting question is whether this code works in the presence of
> web components.

blaim says it comes from bug 1040735 so it should work.  Though clearly there exists some case in which it doesn't.
The tester that found the issue told me the following:

The accessibility option in settings shows only one menu “color Filter” which is disabled. Still we are able to reproduce the crash.
Below are the steps to reproduce this-
1. Dial a number in dial pad.
2. Long press the green icon used to make a call.
3. You will find phone just crashed pop-up.
we can not repo. this issue on both ZTE open L and Nexus5-L per comment 24.
Hi David:
 Could you please also check this issue to see any light to shed?
Flags: needinfo?(dbaron)
What are the hg changesets (or git revisions) for the source corresponding to the minidumps in comment 2 and comment 27?  It's hard to analyze them with confidence without knowing that.

(Also, I'm in meetings all day today.)
Flags: needinfo?(dbaron)
Er, I guess comment 25 answers that question for comment 27's minidump (which I didn't look at as closely as comment 2's).
(In reply to Greg Grisco from comment #24)
> The tester that found the issue told me the following:
> 
> The accessibility option in settings shows only one menu “color Filter”
> which is disabled. 

AFAIK, if the menu for screen reader is not visible, it has never been activated by the user. The screen reader menu item would remain visible forever if it was enabled at least once, or enabled via the Developer Settings.

> Still we are able to reproduce the crash.
> Below are the steps to reproduce this-
> 1. Dial a number in dial pad.
> 2. Long press the green icon used to make a call.
> 3. You will find phone just crashed pop-up.
(In reply to Yura Zenevich [:yzen] from comment #32)
> (In reply to Greg Grisco from comment #24)
> 
> AFAIK, if the menu for screen reader is not visible, it has never been
> activated by the user. The screen reader menu item would remain visible
> forever if it was enabled at least once, or enabled via the Developer
> Settings.

So you're saying that it seems inconsistent that with screen reader never being enabled that we're landing in a11y code?  Is there any explanation for this?
This one crash has been seen so many times that it could block our FC, we need to have some solution for this very soon.
Flags: needinfo?(mlee)
Yura, please respond to Greg's comment #33:

> So you're saying that it seems inconsistent that with screen reader never
> being enabled that we're landing in a11y code?  Is there any explanation for
> this?

As Greg mentions in comment 34, this is a critical crash that we need to resolve and understanding what's happening with a11y seems to be key to achieving that.
 
Thanks,
Mike
Flags: needinfo?(mlee) → needinfo?(yzenevich)
Revisiting comment 6, 7 and 8, should null checking be done?
(In reply to Greg Grisco from comment #33)
> (In reply to Yura Zenevich [:yzen] from comment #32)
> > (In reply to Greg Grisco from comment #24)
> > 
> > AFAIK, if the menu for screen reader is not visible, it has never been
> > activated by the user. The screen reader menu item would remain visible
> > forever if it was enabled at least once, or enabled via the Developer
> > Settings.
> 
> So you're saying that it seems inconsistent that with screen reader never
> being enabled that we're landing in a11y code?  Is there any explanation for
> this?

So far the only explanation I have is based on comment 20:

> However, I think this particular crash was found by stability test team
> (through scripting or manual test).  Whether they turned a11y on or not, I'm
> not sure.  I'll try to find out more.  Hopefully you can carry on with the
> analysis while I find out, thanks.

If at any point an integration test was run, accessibility will be turned on.
Flags: needinfo?(yzenevich)
Attached patch 1145724 null check patch (obsolete) — Splinter Review
Adding a null check patch in the meanwhile.
Attachment #8583346 - Flags: review?(surkov.alexander)
bug 1147646 is another a11y crash that was recently found by stress test.  Although, that crash has only been seen once so far.  Referencing it here in case it helps the analysis.

I'm going to apply the null-check patch, thanks.
Yura, could you help on this bug?
Assignee: nobody → yzenevich
Comment on attachment 8583346 [details] [diff] [review]
1145724 null check patch

Review of attachment 8583346 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with comment fixed

::: accessible/base/nsAccessibilityService.cpp
@@ +535,5 @@
>      if (!child) {
> +      Accessible* container = document->GetContainerAccessible(aChildNode);
> +      if (container) {
> +        a11y::TreeWalker walker(container, aChildNode,
> +                              a11y::TreeWalker::eWalkCache);

container is not really used when you don't create the tree so you can use 'document' if container is null
nit: please make sure you lined up arguments
Attachment #8583346 - Flags: review?(surkov.alexander) → review+
Addressed Alex's comments, carrying over r+
Attachment #8583813 - Flags: review+
Attachment #8583346 - Attachment is obsolete: true
Whiteboard: [b2g-crash][caf priority: p2][CR 811496][caf-crash 584] → [b2g-crash][caf priority: p1][CR 811496][caf-crash 584]
Ill land the null check patch once the tree is open.
Whiteboard: [b2g-crash][caf priority: p1][CR 811496][caf-crash 584] → [b2g-crash][caf priority: p2][CR 811496][caf-crash 584]
Flags: needinfo?(sku)
https://hg.mozilla.org/mozilla-central/rev/3f5c7e485c60
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Attachment #8583813 - Flags: approval-mozilla-b2g37?
Flags: needinfo?(bbajaj)
Comment on attachment 8582242 [details]
decoded minidump - AU_LINUX_GECKO_LF.BR.1.2.3.00.00.00.000.110

[Triage Comment]

Approving the low risk null check
Flags: needinfo?(bbajaj)
Attachment #8582242 - Flags: approval-mozilla-b2g37+
Attachment #8583813 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
Attachment #8582242 - Flags: approval-mozilla-b2g37+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: