Closed Bug 1318667 Opened 8 years ago Closed 7 years ago

Crash in libdvm.so@0x849e6

Categories

(Firefox for Android Graveyard :: General, defect, P1)

51 Branch
x86
Android
defect

Tracking

(relnote-firefox 51+, fennec51+, firefox51+ verified, firefox52+ verified, firefox53+ verified, firefox54+ verified)

VERIFIED FIXED
Firefox 54
Tracking Status
relnote-firefox --- 51+
fennec 51+ ---
firefox51 + verified
firefox52 + verified
firefox53 + verified
firefox54 + verified

People

(Reporter: marcia, Assigned: sebastian)

References

Details

(Keywords: crash, regression, topcrash, Whiteboard: [MobileAS])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-793d8ae8-e984-4d8d-9baa-d7d0e2161118.
=============================================================

This appeared as a new startup crash in the first beta: http://bit.ly/2fM7oMI

All the devices seem to be TR10CS1 which is a tablet.
Hi Sebastian,
Can you help take a look at this one? It crashed in 51.0b1.
Flags: needinfo?(s.kaspari)
Adding a few signatures, with the added volume this is becoming the top overall crash on 51.0b1.
Crash Signature: [@ libdvm.so@0x849e6] → [@ libdvm.so@0x849e6] [@ libdvm.so@0x84d96] [@ libdvm.so@0x84c86] [@ libdvm.so@0x849d6]
Keywords: topcrash
This seems to be a crash in native code. I'll flag it for triage.
tracking-fennec: --- → ?
Flags: needinfo?(s.kaspari)
This is crash number 2, 3, 4, and 7 of the top 10 crashes.
Crash Signature: [@ libdvm.so@0x849e6] [@ libdvm.so@0x84d96] [@ libdvm.so@0x84c86] [@ libdvm.so@0x849d6] → [@ libdvm.so@0x849e6] [@ libdvm.so@0x84d96] [@ libdvm.so@0x84c86] [@ libdvm.so@0x849d6] [@ libdvm.so@0x84736] [@ libdvm.so@0x3f686]
[Tracking Requested - why for this release]: top crash
Track 51+ as top crash in fennec.
Hi :sebastian,
Can you help find an owner for this?
Flags: needinfo?(s.kaspari)
See Also: → 1320153
No longer blocks: 1320153
This looks like it could be media related. Blake, could you folks take a look?
tracking-fennec: ? → 51+
Flags: needinfo?(s.kaspari) → needinfo?(bwu)
I don't recall libdvm.so is media related. I could be wrong. 
John, 
Could you have a look?
Flags: needinfo?(bwu) → needinfo?(jolin)
IIRC, libdvm is Dalvik Java VM. Do we have bugreport or logcat dump for these crashes?
Flags: needinfo?(jolin)
Yeah, libdvm is dalvik. I saw a few logcats that indicated there was some media stuff going on, but looking at some others now I don't see that, so maybe it's unrelated. Regardless, I don't really see anything actionable since the trace is so useless.
Mark 51 won't fix as there is nothing actionable now.
One of these signatures ([@ libdvm.so@0x849e6 ]) spiked slightly in the last Fennec beta, and there are now about 1500 crashes (202 installs) in the 51 release build of Fennec. 

The crashing device in this signature is TR10CS1, which appears to be some kind of educational tablet.
Look as if most of the other signatures map to different types of tablets, all running Kit Kat (API 19) - Examples:

libdvm.so@0x849d6: Asus
libdvm.so@0x84c86: Dell
Crash Signature: [@ libdvm.so@0x849e6] [@ libdvm.so@0x84d96] [@ libdvm.so@0x84c86] [@ libdvm.so@0x849d6] [@ libdvm.so@0x84736] [@ libdvm.so@0x3f686] → [@ libdvm.so@0x849e6] [@ libdvm.so@0x84d96] [@ libdvm.so@0x84c86] [@ libdvm.so@0x849d6] [@ libdvm.so@0x84736] [@ libdvm.so@0x3f686] [@ libdvm.so@0x85a56]
Keywords: regression
snorp: I know you noted in Comment 12 that is isn't actionable - any other ideas as what we could do further to investigate? Volume is fairly high in early 51 data. Some of the comments in the first signature do mention videos.

Also is there any significance to the fact that it started in beta and wasn't present before that?
Flags: needinfo?(snorp)
These are all x86 devices. They don't exist in great numbers for Aurora & Nightly. TR10CS1 seems to be an educational tablet computer with a keyboard distributed to Venezuelan students. The Dell Venue devices might be the easiest to come by in the NA market. It is doubtful that these will be available as new, they were produced in 2014.
Crash Signature: [@ libdvm.so@0x849e6] [@ libdvm.so@0x84d96] [@ libdvm.so@0x84c86] [@ libdvm.so@0x849d6] [@ libdvm.so@0x84736] [@ libdvm.so@0x3f686] [@ libdvm.so@0x85a56] → [@ libdvm.so@0x849e6] [@ libdvm.so@0x84d96] [@ libdvm.so@0x84c86] [@ libdvm.so@0x849d6] [@ libdvm.so@0x84736] [@ libdvm.so@0x3f686] [@ libdvm.so@0x85a56] [@ libdvm.so@0x85c86] [@ libdvm.so@0x84a46]
these signatures are accounting for 35% of all reported crashes on fennec 51 right now.
I temporarily halted 51 fennec staged rollout (which was set at 10% of all users) this morning while we investigated. It is now re-enabled. 
The specific device most heavily affected is TR10CS1_19, seen here: https://crash-stats.mozilla.com/search/?signature=%5Elibdvm.so%400x8&android_model=%3DTR10CS1&android_model=%3DTR10CS1&version=51.0&date=%3E%3D2016-04-01T00%3A00%3A00.000Z&date=%3C2017-01-26T16%3A56%3A00.000Z&_sort=-date&_facets=signature&_facets=android_brand&_facets=android_model&_facets=android_device&_facets=android_hardware&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-android_device

I added TR10CS1_19 (just one of many TR10CS1 devices on the market) to the list of excluded devices in the Play Store. I am fairly sure this means that until we fix the crashing issue and un-exclude that device, the TR10CS1 users will keep the version they have, but won't be able to update, and won't be able to install Firefox from a new download.
This search should include all the problematic devices. Unfortunately ECS is an OEM so the devices show up from many different brands. The search looks for crashes from Firefox for Android where the signature starts with libdvm.so@0x and the device architecture is x86.  

https://crash-stats.mozilla.com/search/?signature=%5Elibdvm.so%400x&cpu_arch=x86&product=FennecAndroid&_sort=-date&_facets=android_device&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=cpu_arch#facet-android_device
I looked at a few of the crashes and they all seem to happen on the "GeckoIconTask" thread. GeckoIconTask landed in bug 1300543 for 51, so that makes sense. Maybe we can spin a dot release where we disable GeckoIconTask just for this device.
Assignee: nobody → s.kaspari
Flags: needinfo?(snorp) → needinfo?(s.kaspari)
BTW I think "GeckoIconTask" is only for loading favicons, so disabling it involves some loss of functionality but it's not significant IMO.
This may show up in beta 52 and if so maybe we can fix it in time for 52 release.
(In reply to Jim Chen [:jchen] [:darchons] from comment #21)
> I looked at a few of the crashes and they all seem to happen on the
> "GeckoIconTask" thread.

Where do you see that it's related to GeckoIconTask? I can't find it in the crash reports. This is plain Java code. If it's crashing libdvm.so then something pretty weird is going on.

(In reply to Jim Chen [:jchen] [:darchons] from comment #22)
> BTW I think "GeckoIconTask" is only for loading favicons, so disabling it
> involves some loss of functionality but it's not significant IMO.

That's correct. That could be an emergency quick fix but isn't really a solution I would like to ship.
Flags: needinfo?(s.kaspari)
(In reply to Sebastian Kaspari (:sebastian) from comment #24)
> (In reply to Jim Chen [:jchen] [:darchons] from comment #21)
> > I looked at a few of the crashes and they all seem to happen on the
> > "GeckoIconTask" thread.
> 
> Where do you see that it's related to GeckoIconTask? I can't find it in the
> crash reports. This is plain Java code. If it's crashing libdvm.so then
> something pretty weird is going on.

I looked at the binary minidump files from several crash reports, and the crashing threads were all GeckoIconTask. 
Seems like some kind of dalvik bug we're hitting. There are definitely different things we can try (e.g. not use ThreadPoolExecutor), but we have to act quickly.
Crash Signature: [@ libdvm.so@0x849e6] [@ libdvm.so@0x84d96] [@ libdvm.so@0x84c86] [@ libdvm.so@0x849d6] [@ libdvm.so@0x84736] [@ libdvm.so@0x3f686] [@ libdvm.so@0x85a56] [@ libdvm.so@0x85c86] [@ libdvm.so@0x84a46] → [@ libdvm.so@0x849e6] [@ libdvm.so@0x84d96] [@ libdvm.so@0x84c86] [@ libdvm.so@0x849d6] [@ libdvm.so@0x84736] [@ libdvm.so@0x3f686] [@ libdvm.so@0x85a56] [@ libdvm.so@0x85c86] [@ libdvm.so@0x84a46] [@ dalvik-zygote (deleted)@0x14bf]
Still the top crash for 51.0.1 for Fennec. I'm blocking updates (excluding in Play Store) for a few more of the devices most severely affected by the startup crash:  

  Dell Venue 8 yellowtail
  ZTE V975– redhookbay
  acer B1-730HD vespa
  AcerA1-830– ducati
  AsusTransformer Pad (TF103CG)– K018
I got one of those devices from Ebay and will debug this as soon as it arrives. In the meantime excluding those devices on Google Play might be our best option. Disabling the IconTask for them is a bit cumbersome if we do not know all affected devices. Disabling this code for all users is not an option.
I added these tablets to the excluded devices: 
ASUS Transformer Pad K018, K017, K014, K01A, K00g
Dell Venue 8 - yellowtail
Once we have a fix, we should try taking them off the list so folks can update.
I received the device today and the "good" news is that it's crashing permanently. I'll try to debug this today.
Took quite a while to get this reproducing with my own builds. So far only the release version crashed, but not Nightly, Aurora, Beta and local builds. However I can now produce it using the release branch and the official release branding:

The palette support library is the culprit. We use it to extract a dominant color from icons (we use the color in the UI). Coincidentally our release Firefox icon is triggering this crash in the support library. That's why it's mostly only happening in the release version and that's also the reason why it's happening so often (and on tablets): We load the icon to display it on the tab strip etc when loading about:home. But there's no reason why other icons shouldn't trigger that too.

51.0 is the first release where we switched to the palette library because it's significantly faster than our custom implementation (that we still have in the code base for other reasons). A quick fix is to switch back to our own implementation on x86 devices (and keep the faster library on other devices). One more reason to update the support library soon (bug 1333704).

I'll prepare a patch - should be slow and risk should be low.
I need to create a separate patch for the release branch (and maybe others).
Comment on attachment 8833355 [details]
Bug 1318667 - Do not use palette library on x86 devices (Use BitmapUtils.getDominantColor()).

https://reviewboard.mozilla.org/r/109602/#review110668

Eughh. (I wonder if newer versions fix this?)
Attachment #8833355 - Flags: review?(ahunt) → review+
This is the patch for the release branch.
Comment on attachment 8833355 [details]
Bug 1318667 - Do not use palette library on x86 devices (Use BitmapUtils.getDominantColor()).

This one applies to Aurora too (I'll add the details for the request later after adding all the patches).
Attachment #8833355 - Flags: approval-mozilla-aurora?
Comment on attachment 8833360 [details] [diff] [review]
1318667-Release.patch

This one applies to Beta and Release.
Attachment #8833360 - Flags: approval-mozilla-release?
Attachment #8833360 - Flags: approval-mozilla-beta?
Pushed by s.kaspari@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/26b85661155e
Do not use palette library on x86 devices (Use BitmapUtils.getDominantColor()). r=ahunt
(Request for Aurora, Beta and Release uplift)

Approval Request Comment

[Feature/Bug causing the regression]: In Firefox 51.0 we refactored the icon code and decided to switch to the palette library for color extraction (Faster than our own implementation). The switch was done in bug 1300569.

[User impact if declined]: We see a bunch of crashes on x86 devices. This doesn't happen for all icons but at least for our Firefox release icon, which gets loaded quite often. So for those users it's basically a crash loop.

[Is this code covered by automated tests?]: Yes, the new icon code is covered. But this crash is coming from inside Android's support library.

[Has the fix been verified in Nightly?]: Nightly is not directly affected - or at least we do not know which other website icon might trigger this. So far I manually verified the patch with a custom release build on a TF103CG.

[Needs manual test from QE? If yes, steps to reproduce]: Not necessarily. But the steps are: Get one of the affected devices. Install the release version of 51.0. Load a website, open a new tab.

[List of other uplifts needed for the feature/fix]: -

[Is the change risky?]: The patch itself is not risky.

[Why is the change risky/not risky?]: On x86 devices (that's the "smallest" group I can identify of impacted devices) we do not use the palette library anymore with that patch. Instead we fallback to our custom color extraction code. This code has been in place in previous releases and has no known crashes.

[String changes made/needed]: None
FOr the current release, might it not be better, and more inline with normal procedures to just revert the patch from bug 1300569, and take this fix for version 52 and forward?
You'd need to revert the patch from bug 1300569 and (or only) patch 6 from bug 1300543 (this one actually replaces the code in the icon pipeline). And the code was modified since then so we'd need a custom patch again anyways. Not sure if we gain much by that.
(In reply to Sebastian Kaspari (:sebastian) from comment #40)
> You'd need to revert the patch from bug 1300569 and (or only) patch 6 from
> bug 1300543 (this one actually replaces the code in the icon pipeline). And
> the code was modified since then so we'd need a custom patch again anyways.
> Not sure if we gain much by that.

OK if this is simpler than the backout.  Was just saying our policy is backout and it is way past release date for this and still hardly anyone running version 51 on Android devices from what I can see. If this code avoids the issue and is easier than the backout, I am all for it.
Okay, I just verified. Just backing this one out is an option for release too:
https://hg.mozilla.org/mozilla-central/rev/4e9bf0dca65a

This works without conflicts and just revokes the change in the pipeline (we still include the library in the build though, we just don't use it) -> No crash.
https://hg.mozilla.org/mozilla-central/rev/26b85661155e
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 54
Comment on attachment 8833360 [details] [diff] [review]
1318667-Release.patch

OK, let's take it on all branches. I also prefer the patch than the backout.

By the way, maybe we should write a none regression test here?
Attachment #8833360 - Flags: approval-mozilla-release?
Attachment #8833360 - Flags: approval-mozilla-release+
Attachment #8833360 - Flags: approval-mozilla-beta?
Attachment #8833360 - Flags: approval-mozilla-beta+
Attachment #8833355 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Possibly related to this g-bug (same Android version 19=4.4.0-4.4.4, on intel devices):
https://code.google.com/p/android/issues/detail?id=174522

They claim to have added code to catch Exceptions in the palette library in support library 23.1.0 (we're on 23.4.0), so it could be completely unrelated.
(In reply to Gerry Chang [:gchang] from comment #13)
> Mark 51 won't fix as there is nothing actionable now.

Now that we have something actionable shouldn't this be changed to affected?
I literally mid-aired your comment trying to change it.
Added (probably by Gerry) to the release notes with
"Fix a top crash caused by Android library (Palette) on some x86 devices (Bug 1318667)"
I just tested the release APK on my Asus tablet and something weird is happening: The x86 APK is working and does not crash anymore. However I can install the ARM APK too. It doesn't show our "wrong architecture" toast - it just starts normally. However it then crashes with the same signature. It looks like that the tablet not only can run ARM APKs in some compatibility mode. It also pretends to be an ARM device and our checks do not work. I'll file a separate bug for that. However the consequence for this bug is that we are still going to see this crash if the wrong APK is installed on a device (Hopefully this does not happen via Google Play).
See Also: → 1337318
We have verification of the fix from the duplicate.
Iteration: --- → 1.15
Priority: -- → P1
Whiteboard: [MobileAS]
See Also: → 1408691
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: