Closed Bug 1211266 Opened 9 years ago Closed 9 years ago

Occasional missing icons after flashing homescreen

Categories

(Core :: IPC, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

VERIFIED FIXED
mozilla45
blocking-b2g 2.5+
Tracking Status
firefox45 --- fixed
b2g-master --- verified

People

(Reporter: kgrandon, Assigned: baku)

References

Details

Attachments

(19 files, 14 obsolete files)

3.96 KB, text/plain
Details
46 bytes, text/x-github-pull-request
cwiiis
: review+
Details | Review
162.49 KB, text/x-log
Details
3.71 KB, text/plain
Details
4.34 KB, text/plain
Details
180.01 KB, text/x-log
Details
2.61 KB, text/plain
Details
1011 bytes, text/x-sh
Details
261.77 KB, text/x-log
Details
8.45 KB, text/plain
Details
10.47 KB, patch
Details | Diff | Splinter Review
294.65 KB, text/x-log
Details
344.75 KB, text/x-log
Details
24.49 KB, text/plain
Details
15.82 KB, text/plain
Details
1.56 KB, patch
Details | Diff | Splinter Review
16.56 KB, patch
bent.mozilla
: review+
gerard-majax
: feedback+
Details | Diff | Splinter Review
57 bytes, text/x-github-pull-request
gmarty
: review+
Details | Review
46 bytes, text/x-github-pull-request
gmarty
: review+
Details | Review
This happens with the homescreen app and not verticalhome. I think this is a race condition with using the homescreen-webapps-manage permission and privileged icon APIs.

Noticed this on a flame.

This is possibly the same issue as: https://bugzilla.mozilla.org/show_bug.cgi?id=1042807#c22
I get the same behavior on b2gdroid.  These errors in logcat seem related:

0-03 17:02:51.893 I/Gecko   ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-03 17:02:51.893 I/Gecko   ( 5452): 
10-03 17:02:51.893 I/Gecko   ( 5452): 
10-03 17:02:51.893 I/Gecko   ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-03 17:02:51.893 I/Gecko   ( 5452): 
10-03 17:02:51.893 I/Gecko   ( 5452): 
10-03 17:02:51.893 I/Gecko   ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-03 17:02:51.893 I/Gecko   ( 5452): 
10-03 17:02:51.893 I/Gecko   ( 5452): 
10-03 17:02:51.893 I/Gecko   ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-03 17:02:51.893 I/Gecko   ( 5452): 
10-03 17:02:51.906 I/Gecko   ( 5452): 
10-03 17:02:51.906 I/Gecko   ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-03 17:02:51.906 I/Gecko   ( 5452): 
10-03 17:02:51.906 I/Gecko   ( 5452): 
10-03 17:02:51.906 I/Gecko   ( 5452): ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-03 17:02:51.906 I/Gecko   ( 5452):
Whiteboard: [systemsfe]
Ted, can you help out here?
blocking-b2g: --- → 2.5+
Flags: needinfo?(tclancy)
I'll try to play with that. It's been something we have been spotting a couple of times already ... on several devices
Assignee: nobody → lissyx+mozillians
Flags: needinfo?(tclancy)
Kevin, can you provide STRs ?
Flags: needinfo?(kevingrandon)
(In reply to Alexandre LISSY :gerard-majax from comment #4)
> Kevin, can you provide STRs ?

The STR is: 
1) Flash device with make reset-gaia
2) Scroll down the entire app list until you see an empty space with no icon (the app name will be there).
3) If all icons are present, repeat from 1. 


There's ~50 icons installed in an engineering build, and I saw it happen to a single icon in the grid about ~50% of the time. The reproduction of this is only 1% or so per icon.
Flags: needinfo?(kevingrandon)
Ok, I did:
 0. make NOFTU=1 reset-gaia
 1. Settings, Homescreens, select "New Home Screen"

This changed my homescreen and I have three missing icons: Camera, Twitter and FM Radio. Logcat confirms what Fabrice said:
10-09 13:59:10.959  3595  3595 E GeckoConsole: [JavaScript Error: "ConstraintError" {file: "jar:file:///system/b2g/omni.ja!/components/DataStoreImpl.js" line: 233}]
10-09 13:59:11.029   212   212 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
10-09 13:59:15.229  3595  3632 I Gecko   : 
10-09 13:59:15.229  3595  3632 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-09 13:59:15.229  3595  3632 I Gecko   : 
10-09 13:59:15.229  3595  3632 I Gecko   : 
10-09 13:59:15.229  3595  3632 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-09 13:59:15.229  3595  3632 I Gecko   : 
10-09 13:59:15.759  3595  3632 I Gecko   : 
10-09 13:59:15.759  3595  3632 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-09 13:59:15.759  3595  3632 I Gecko   : 
10-09 13:59:15.759  3595  3632 I Gecko   : 
10-09 13:59:15.759  3595  3632 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-09 13:59:15.759  3595  3632 I Gecko   : 
10-09 13:59:15.919  3595  3632 I Gecko   : 
10-09 13:59:15.919  3595  3632 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-09 13:59:15.919  3595  3632 I Gecko   : 
10-09 13:59:15.919  3595  3632 I Gecko   : 
10-09 13:59:15.919  3595  3632 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
10-09 13:59:15.919  3595  3632 I Gecko   : 
10-09 13:59:16.039   212   212 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
10-09 13:59:16.859  4289  4289 E New Home Screen: Content JS ERROR: Failed to retrieve icon [object DOMError] 
10-09 13:59:16.859  4289  4289 E New Home Screen:     at proto.refresh/< (app://homescreen.gaiamobile.org/bower_components/gaia-app-icon/script.js:552:13)
10-09 13:59:17.659  4477  4520 I (Preallocated app): PAC file installed from data: URI
10-09 13:59:17.829  3595  3632 I Gecko   : 
10-09 13:59:17.829  3595  3632 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x18001D,name=PBackground::Msg_PBlobConstructor) Value error: message was deserialized, but contained an illegal value
10-09 13:59:17.829  3595  3632 I Gecko   : 
10-09 13:59:17.829  3595  3632 I Gecko   : 
10-09 13:59:17.829  3595  3632 I Gecko   : ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0x1E0003,name=???) Route error: message sent to unknown actor ID
DOMError is unrelated, it is always reproduced and concerns mochi.test app.
Priority: -- → P1
cc-ing Sam as he's been looking at a similar blob ipc issue when using device storage.
Unable to reproduce at all with a debug Gecko build on a Flame :(
(In reply to Alexandre LISSY :gerard-majax from comment #9)
> Unable to reproduce at all with a debug Gecko build on a Flame :(

Are you able to reproduce with a normal build on a flame?
(In reply to Kevin Grandon :kgrandon from comment #10)
> (In reply to Alexandre LISSY :gerard-majax from comment #9)
> > Unable to reproduce at all with a debug Gecko build on a Flame :(
> 
> Are you able to reproduce with a normal build on a flame?

Yes. And on a Z3c non debug with lots of applications. No need to reset-gaia, on that profile, just switching homescreen is enough. And even after switching once, I still got occurrences of PBlob errors ...
And after a reboot there is no trace of the issue. Unable to reproduce also when installing Gecko as an update.
Attached file gdb stacktrace
So I have been able to reproduce that on my first try with Z3c, Gecko built in debug and following the STRs of reset-gaia then switch to new homescreen.
Flags: needinfo?(Jan.Varga)
> (gdb) print idTableEntry
> $1 = {mRawPtr = 0xb6f65050 <_pr_envLock>}
That assert comes from bug 1076975.
Attached file homescreen_icons_0.log (obsolete) —
logcat while reproducing
Component: Gaia::Homescreen → DOM: IndexedDB
Product: Firefox OS → Core
Component: DOM: IndexedDB → IPC
Duplicates this (https://bugzilla.mozilla.org/show_bug.cgi?id=1209545) but this one is probably better for tracking it I think.
So it would look like we are calling |IDTableEntry::GetOrCreateInternal()| https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4505 from https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#1412

Since the MOZ_ASSERT() is https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#3991 it means what we get from |IDTableEntry::Create| is wrong.

With gdb I could check:
> (gdb) print id
> $3 = {m0 = 0, m1 = 61615, m2 = 45716, m3 = "\000\000\000\000\000\000\000"}

I don't know if that value is a correct one though.

There are four ways to return null from that code:
 (1) https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4523
 (2) https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4535
 (3) https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4539
 (4) https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#4543

We know we pass:
> /* aMayCreate */       true,
> /* aMayGet */          false,
> /* aIgnoreProcessID */ false

The logcat shows:
> 10-13 12:17:02.091  1525  1556 I Gecko   : [Parent 1525] WARNING: '!aMayCreate', file ../../../gecko/dom/ipc/Blob.cpp, line 4542
> 10-13 12:17:02.091  1525  1556 I Gecko   : [Parent 1525] WARNING: '!idTableEntry', file ../../../gecko/dom/ipc/Blob.cpp, line 3965
> 10-13 12:17:02.092  1525  1556 F MOZ_Assert: Assertion failure: false (), at ../../../gecko/dom/ipc/Blob.cpp:3966

So we are reaching condition (4) and issue the warning. And thus we fail at https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#3965 |NS_WARN_IF(!idTableEntry)|.

All those are in the |AnyBlobConstructorParams::TKnownBlobConstructorParams| case. Yet, we get a stack that is about assertion failure at https://dxr.mozilla.org/mozilla-central/rev/11ff0ccb7d59311df4c190d331c8b58c6e35a0c8/dom/ipc/Blob.cpp#3991 in |AnyBlobConstructorParams::TSameProcessBlobConstructorParams|
Fun fact: it looks like delaying the publication of "displayed" event in Gaia when switching homescreen makes the bug disappearing ...
(In reply to Alexandre LISSY :gerard-majax from comment #20)
> Fun fact: it looks like delaying the publication of "displayed" event in
> Gaia when switching homescreen makes the bug disappearing ...

Hey Alex, since this bug is assigned to you, are you working on the fix for this? Or does this bug need to be addressed by the platform team? :)

Thanks!
(In reply to Alexandre LISSY :gerard-majax from comment #20)
> Fun fact: it looks like delaying the publication of "displayed" event in
> Gaia when switching homescreen makes the bug disappearing ...

It's not "displayed" but "changed".
(In reply to Candice Serran (:cserran) from comment #21)
> (In reply to Alexandre LISSY :gerard-majax from comment #20)
> > Fun fact: it looks like delaying the publication of "displayed" event in
> > Gaia when switching homescreen makes the bug disappearing ...
> 
> Hey Alex, since this bug is assigned to you, are you working on the fix for
> this? Or does this bug need to be addressed by the platform team? :)
> 
> Thanks!

I fear that is something platform will have to work on. So far everything I have been able to hack would point to some kind of race condition around Blob passing within process. That is triggered when we are changing the homescreen because at that moment we will kill the previous homescreen, instantiate the new one and then make it visible.

We do notify to start displaying it by publishing that "changed" event. By doing so, the homescreen code will start to fetch apps icons. Then it seems we have something bad happening at that time but as exposed in comment 19 it's unclear exactly where it is failing.
I'm going to take a look at the issue.
Flags: needinfo?(Jan.Varga)
Attached file homescreen_entry_get_put.log (obsolete) —
Checking the IDTableEntry Get() and Put() calls. One thing that is interesting in that is the latest ID we manipulate before crashing is "61de7b13-54c2-4a44-8eba-4e59f6d30e95".

Now, just before we crash, we see:
> 10-16 16:40:12.562 18167 18198 I Gecko   : AllocPBlobParent:209 => AllocPBlobParent
> 10-16 16:40:12.562 18167 18198 I Gecko   : CreateFromParams:3960 => TKnownBlobConstructorParams: IN
> 10-16 16:40:12.562 18167 18198 I Gecko   : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Get()
> 10-16 16:40:12.562 18167 18198 I Gecko   : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Get() => entry NOK
> 10-16 16:40:12.562 18167 18198 I Gecko   : [Parent 18167] WARNING: '!aMayCreate', file ../../../gecko/dom/ipc/Blob.cpp, line 4551
> 10-16 16:40:12.562 18167 18198 I Gecko   : [Parent 18167] WARNING: '!idTableEntry', file ../../../gecko/dom/ipc/Blob.cpp, line 3966
> 10-16 16:40:12.562 18167 18198 F MOZ_Assert: Assertion failure: false (), at ../../../gecko/dom/ipc/Blob.cpp:3967

So we do not find an entry in IDTableEntry::GetOrCreateInternal, hence we are in https://dxr.mozilla.org/mozilla-central/rev/2387ada864282880d3a498d643abe3d8b887ee59/dom/ipc/Blob.cpp#4541

Yet, if you look earlier in that log for the same ID value, we get something different:
> 10-16 16:40:08.641 18167 18167 I Gecko   : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Get()
> 10-16 16:40:08.641 18167 18167 I Gecko   : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Get() => entry NOK
> 10-16 16:40:08.641 18167 18167 I Gecko   : IDTableEntry::GetOrCreateInternal: aID={61de7b13-54c2-4a44-8eba-4e59f6d30e95} ->Put()

On that call, we do have no entry, but we are able to reach the end of the method and to call Put() to populate the table.
Attached file homescreen_entry_get_put_3_SUCCESS.log (obsolete) —
Comment on attachment 8674985 [details] [review]
[gaia] Cwiiis:bug1211266-work-around-get-icon > mozilla-b2g:master

Got r=gmarty earlier, was leaving until the last minute in case the platform bug got fixed before now.
Attachment #8674985 - Flags: review+
Attached file homescreen_entry_get_put_4_FAIL.log (obsolete) —
Can you have a look at comment 26 and attachment 8674919 [details] and attachment 8674998 [details] ? It seems that the sIDTable behavior is the a good root to the problem but I still suspect that this is just a symptom. How can we debug further?
Flags: needinfo?(amarchesini)
Flags: needinfo?(Jan.Varga)
I've merged the work-around: https://github.com/mozilla-b2g/gaia/commit/4d592d4d385c85296511216e7c702c71bf2c9540

Not sure if we want to track the platform bug in this bug or not - I'll leave that up to others.
Attached file homescreen_entry_get_put_7.log (obsolete) —
Crash with Blob 5ee51033-aac4-4576-a673-d33884bfb6e1
Attached patch blob-ipc-homescreen.patch (obsolete) — Splinter Review
Debugging patch
This is the stack we get if we break when we are in the ::Get() call in the unexpected state: that matches the NS_WARN_IF(!aMayCreate).
Attached file homescreen_get_put_assert.log (obsolete) —
Log matching the attachment 8675309 [details]: gdb stacktrace with MOZ_ASSERT(false) added
(In reply to Alexandre LISSY :gerard-majax from comment #36)
> Created attachment 8675310 [details]
> homescreen_get_put_assert.log
> 
> Log matching the attachment 8675309 [details]: gdb stacktrace with
> MOZ_ASSERT(false) added

From that we know that ID "efa6019f-ccab-4b86-966c-26bbe2819e08" did propery got inserted with the Put() call and a subsequent Get() call did succeed. Now when we hit the failure, we do see that |entry| is a null:
> (gdb) f
> #0  0xb3794da8 in mozilla::dom::BlobParent::IDTableEntry::GetOrCreateInternal (aID=..., aProcessID=-1619102152, aBlobImpl=aBlobImpl@entry=0x0, aMayCreate=aMayCreate@entry=false, aMayGet=aMayGet@entry=true, 
    aIgnoreProcessID=aIgnoreProcessID@entry=false) at ../../../gecko/dom/ipc/Blob.cpp:4558
> 4558	        MOZ_ASSERT(false);
> (gdb) print entry
> $1 = {mRawPtr = 0x0}
Attached file gdb sIDTable (obsolete) —
That log exposes the crash. And we can see the Blob ID "842b74dd-7a7a-431c-87aa-2e927f66af3f". That blob gets removed from the hashmap just before we try to do a Get(), because of IDTableEntry destructor being called.
Attached file stacktrace of Remove()
Attached file addr2line
That is the stack addr2line result of attachment 8675643 [details]
(In reply to Alexandre LISSY :gerard-majax from comment #41)
> Created attachment 8675644 [details]
> addr2line
> 
> That is the stack addr2line result of attachment 8675643 [details]

So we remove the entry after receiving an IPC "PBlob::Msg___delete__" message.
More debugging, to try to identify if the IPC |Send::__delete__| call could come from |NoteDyingRemoteBlobImpl| but so far it's not very completely useful.
Attachment #8672256 - Attachment is obsolete: true
Attachment #8674900 - Attachment is obsolete: true
Attachment #8674919 - Attachment is obsolete: true
Attachment #8674998 - Attachment is obsolete: true
Attachment #8675276 - Attachment is obsolete: true
Attachment #8675309 - Attachment is obsolete: true
Attachment #8675310 - Attachment is obsolete: true
Attachment #8675318 - Attachment is obsolete: true
Attached patch blob-ipc-homescreen.patch (obsolete) — Splinter Review
Attachment #8675277 - Attachment is obsolete: true
Hacked https://dxr.mozilla.org/mozilla-central/rev/d1a89632277fbaaf470c90a35573776048988f2d/dom/ipc/Blob.cpp#4111 and commented the |PBlobParent::Send__delete__()| statement, I am still hitting the issue. That answers comment 43: we are not receiving the bogus delete call from that call site but from somewhere else.
I have troubles explaining that set of calls that are part of attachment 8675644 [details]:
> mozilla::dom::PContentParent::RemoveManagee(int, mozilla::ipc::IProtocol*)
> /home/alex/codaz/Mozilla/b2g/devices/XperiaZ3c/B2G.L/objdir-gecko-dbg/ipc/ipdl/PContentParent.cpp:2579

> mozilla::dom::PBlobParent::OnMessageReceived(IPC::Message const&)
> /home/alex/codaz/Mozilla/b2g/devices/XperiaZ3c/B2G.L/objdir-gecko-dbg/ipc/ipdl/PBlobParent.cpp:252 (discriminator 1)

> mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&)
> /home/alex/codaz/Mozilla/b2g/devices/XperiaZ3c/B2G.L/objdir-gecko-dbg/ipc/ipdl/PContentParent.cpp:3366

The statement at |PContentParent.cpp:2579| is |DeallocPPresentationParent(actor);|, being part of |PPresentationMsgStart| case in |PContentParent::RemoveManagee|.
The statement at |PBlobParent.cpp:252|     is |return MsgProcessed;|, being part of |PBlob::Msg___delete____ID| case.
The statement at |PContentParent.cpp:3366| is |(routed__)->OnMessageReceived(msg__);|

So it looks like we reach the expected |PContentParent::RemoveManagee| method, but I cannot explain why we enter the |PPresentationMsgStart| case while we are manipulating blobs.
Here we can see that blob d756c688-b7a8-402c-91a1-d07be96798eb gets created by thread 13711 of process 13711. The same thread issues a Remove() operation. The subsequent CreateFromParams() and Get() calls that fails are in issued by thread 13742 from process 13711.
Log shows the correct PBlobMsgStart protocol id being used, but the stacktrace stil matches the RemoveManagee() call site of PPresentationMsgStart, so I'm a little bit lost. I would suspect that there's something wrong when collecting the stack itself and that we do indeed really trigger the correct path of PBlobMsgStart and its RemoveManagee() call. Said otherwise, we receive a legit delalloc request for the blob. Still have to track where this is coming from.
Attachment #8675667 - Attachment is obsolete: true
This blocks replaceable homescreens, but not new homescreen (which is shipping a work-around currently).
Blocks: homescreen-apps
No longer blocks: new-homescreen
More logging, we see PBlob delete issued:

> 10-22 11:29:35.039 29908 29908 I GeckoIPC: [time:1445506175038996][29908->29441][PBlobChild] Sending Msg___delete__([TODO])
> 10-22 11:29:35.040 29908 29908 I GeckoIPC: [time:1445506175040093][29908->29441][PBlobChild] Sending Msg_PBlobStreamConstructor([TODO])
> 10-22 11:29:35.040 29441 29472 I GeckoIPC: [time:1445506175040311][29441<-29908][PBlobParent] Received Msg_PBlobStreamConstructor([TODO])
> 10-22 11:29:35.042 29908 29908 I GeckoIPC: [time:1445506175042367][29908->29441][PBlobChild] Sending Msg___delete__([TODO])
> 10-22 11:29:35.043 29908 29908 I GeckoIPC: [time:1445506175043300][29908->29441][PBlobChild] Sending Msg_PBlobStreamConstructor([TODO])
> 10-22 11:29:35.045 29441 29441 I GeckoIPC: [time:1445506175045524][29441<-29908][PBlobParent] Received Msg___delete__([TODO])
> 10-22 11:29:35.045 29441 29441 I Gecko   : RemoveManagee: aProtocolId:30
> 10-22 11:29:35.045 29441 29441 I Gecko   : RemoveManagee: aProtocolId:30 PBlobMsgStart
> 10-22 11:29:35.045 29441 29441 I GeckoIPC: [time:1445506175045645][29441<-29908][PBlobParent] Received Msg___delete__([TODO])
> 10-22 11:29:35.045 29441 29441 I Gecko   : RemoveManagee: aProtocolId:30
> 10-22 11:29:35.045 29441 29441 I Gecko   : RemoveManagee: aProtocolId:30 PBlobMsgStart
That contains stack from PBlobChild::Send__delete__ call sites.
Excerpt of the end of attachment 8677364 [details] before the crash
Attached file temp5.txt
New stack after hacking StructuredCloneData() for Move constructor
Good news/bad news, it's all so reproduced on B2G Desktop/Mulet build (debug or not)
(In reply to Alexandre LISSY :gerard-majax from comment #58)
> Good news/bad news, it's all so reproduced on B2G Desktop/Mulet build (debug
> or not)

I'm going to try that too.
Flags: needinfo?(Jan.Varga)
I'm still struggling. Tried collecting nsHttp logs yesterday with Mulet build, in case the problems lies with the use of XHR (which produces the blob we send), but each time I enabled NSPR_LOG_MOULES this was not reproduced.

I still have some unanswered questions regarding attachment 8678120 [details]: it looks like we are releasing RemoteBlobImpl because of the StructuredCloneData destructor being called, and that comes from Release() on nsRefPtr<nsFrameMessageManager>. So: 
 - does that fact holds and matches the expected code path?

If so, then I guess it means the lack of reference is either against:
 - BlobChild::RemoteBlobImpl
 - StructuredCloneData

I'm out of ideas of where to check those two ideas :/
(In reply to Jan Varga [:janv] from comment #59)
> (In reply to Alexandre LISSY :gerard-majax from comment #58)
> > Good news/bad news, it's all so reproduced on B2G Desktop/Mulet build (debug
> > or not)
> 
> I'm going to try that too.

Jan, any update here?
Flags: needinfo?(Jan.Varga)
> Jan, any update here?

Yes. It's a regression introduced by StructuredCloneHolder. The problem is that when we retrieve data from a cloned buffer (child process), we create a new PBackground actor for any cloned BlobImpl. This is something that it was not done previously (wrongly), but doing this, when this PBackground Actor is released, it deletes the BlobParent on the parent process, and from here, the crash.

Finally we know where the bug is. Now we have to fix it :)
Flags: needinfo?(amarchesini)
Flags: needinfo?(Jan.Varga)
I guess it's time to switch assignee :)
Assignee: lissyx+mozillians → amarchesini
Whiteboard: [systemsfe]
Attached patch patch (obsolete) — Splinter Review
What I suspect it's happening here is that, racy, the constructor of the PBackground BlobParent is faster than then PContent one. This case it's the first one that has to keep the second one alive. khuey can you review this patch?

BTW, I also would like to have the move ctor patch landed.
Attachment #8680669 - Flags: review?(khuey)
Attachment #8680669 - Flags: feedback?(lissyx+mozillians)
Comment on attachment 8680669 [details] [diff] [review]
patch

Review of attachment 8680669 [details] [diff] [review]:
-----------------------------------------------------------------

::: b2g/app/b2g.js
@@ +299,5 @@
>  // Layers Acceleration.  We can only have nice things on gonk, because
>  // they're not maintained anywhere else.
>  pref("layers.offmainthreadcomposition.enabled", true);
>  #ifndef MOZ_WIDGET_GONK
> +pref("dom.ipc.tabs.disabled", false);

Ignore this. This is just required to reproduce the issue.
Comment on attachment 8680669 [details] [diff] [review]
patch

More than 15 consecutive runs successfull, even with the workaround removed from homescreen.

As soon as I remove that patch, issue gets reproduced very easily again. Looks like you found the fix!
Attachment #8680669 - Flags: feedback?(lissyx+mozillians) → feedback+
Great - here's a PR for removing the work-around from gaia-site-icon: https://github.com/gaia-components/gaia-site-icon/pull/17

And n?me to test/merge that and then open a PR for Gaia once this is fixed.
Flags: needinfo?(chrislord.net)
Attached patch crash.patch (obsolete) — Splinter Review
Attachment #8680669 - Attachment is obsolete: true
Attachment #8680669 - Flags: review?(khuey)
Attachment #8681199 - Flags: review?(khuey)
Comment on attachment 8681199 [details] [diff] [review]
crash.patch

Review of attachment 8681199 [details] [diff] [review]:
-----------------------------------------------------------------

::: b2g/app/b2g.js
@@ +299,5 @@
>  // Layers Acceleration.  We can only have nice things on gonk, because
>  // they're not maintained anywhere else.
>  pref("layers.offmainthreadcomposition.enabled", true);
>  #ifndef MOZ_WIDGET_GONK
> +pref("dom.ipc.tabs.disabled", false);

uh, what is this about?

::: dom/base/StructuredCloneHolder.cpp
@@ +562,5 @@
> +
> +  // This is needed just for workers thread.
> +  if (NS_IsMainThread()) {
> +    return blobImpl.forget();
> +  }

Why?  Isn't this called when somebody takes a PContent blob and puts it in an IndexedDB?
Bent, if you have time, can you tell me something more about EnsureBlobForBackgroundManager?
In your code, this was used only in workers when postMessage() was called. In my unification of the StructureClone use, I use this everywhere, and in general it works fine.

The issue we have here is a race condition created by the 2 BlobChild actors (1 from PContent and 1 from PBackground).
One of these 2 is released, and doing that, it calls __delete__. The BlobParent deletes the IDTableEntry.
But when the second BlobChild, finally, creates the BlobParent for the same RemoteBlob, this parent doesn't have a valid IDTableEntry anymore.

With my patch I don't called EnsureBlobForBackgroundManager when we are not in a worker thread. This works, but it's an hack :)
Flags: needinfo?(bent.mozilla)
Hm. I don't remember this code too well but it sounds like there's a refcount problem on the parent side? If you have two BlobChild actors that are referencing the same blob then when the first is released it shouldn't trigger the deletion of the IDTableEntry right?
Flags: needinfo?(bent.mozilla)
Attached patch patch (obsolete) — Splinter Review
Ok, this is the _real_ fix for this bug. And here a quick description of what the bug is about:

1. When we transfer a Blob from parent to process, at some point, we create a BlobChild.
2. in StructuredCloneHolder we find out that this is a nsIRemoteBlob and we create a BlobChild for PBackground as well.
3. a SendPBlobConstructor is sent to the parent for the creation of a BlobParent in PBackground.
3. nsIRemoteBlob of the first BlobChild (created from nsIContent*) gets released because nothing is keeping it alive.
4. race condition: the Send__delete_ of BlobChild (nsIContent) is received before the CTOR for the BlobParent in PBackground.
5. we remove the entry of IDTableEntry.
6. When the BlobParent for PBackground is created, it finds the ID of the other BlobParent, but that has been deleted in point 4. IDTableEntry doesn't exist any more and we crash.

What this patch does is to keep alive the remoteBlob when we create the PBackground BlobParent from another BlobChild (nsIContent).
Attachment #8681199 - Attachment is obsolete: true
Attachment #8681199 - Flags: review?(khuey)
Attachment #8683264 - Flags: review?(khuey)
Maybe bent, if you have time, you can give me a feedback for this patch.
Flags: needinfo?(bent.mozilla)
(In reply to Andrea Marchesini (:baku) from comment #73)
> Maybe bent, if you have time, you can give me a feedback for this patch.

Sure, I'll look over the weekend.
Comment on attachment 8683264 [details] [diff] [review]
patch

Review of attachment 8683264 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think this is what you really want to do. This holds the BlobImpl alive on the child side forever, right? The whole idea underlying this blob stuff is that when the BlobImpl refcount drops to 0 we destroy the actor. With this patch in some cases that will never happen.

I think what you probably want to do is AddRef the BlobImpl while your PBackground blob is being constructed. Then when it is you can drop the ref. This is basically the same problem we solved with WaitForSliceCreation (http://mxr.mozilla.org/mozilla-central/source/dom/ipc/Blob.cpp#3373) only for the main thread. We do that synchronously, but you could make an async version of that for the main thread:

  protocol PBlob {
    // ...
  child:
    Created();
  };

Then just send that message once the actor is created on PBackground in the parent process, and you can drop the extra ref there.
Attachment #8683264 - Flags: feedback-
Attached patch crash.patch (obsolete) — Splinter Review
bent, I hope you can find time for this new patch!
This patch is exactly the same of before, but in addition it adds a "|child: CreatedFromKnownBlob()|". When the PBlobParent is created, we send a notification to the child in case that blob is built from a known blob.

When the child receives that call, it frees the mDifferentProcessRemoteBlob.
Attachment #8683264 - Attachment is obsolete: true
Attachment #8683264 - Flags: review?(khuey)
Attachment #8688561 - Flags: review?(bent.mozilla)
Attached patch crash.patchSplinter Review
Here a new version. I changed 1 like to improve the check of the kind of blob constructor param we have.
Attachment #8688561 - Attachment is obsolete: true
Attachment #8688561 - Flags: review?(bent.mozilla)
Attachment #8688962 - Flags: review?(bent.mozilla)
Andrew, how do we want to proceed with this bug? I'm sure bent will review this patch soon or late, but how urgent is this bug?
Flags: needinfo?(overholt)
Gregor, wdyt?
Flags: needinfo?(overholt) → needinfo?(anygregor)
(In reply to Andrew Overholt [:overholt] from comment #79)
> Gregor, wdyt?

It's important that we fix it but we are not in a huge hurry. It should be fine if we can fix it in the next 1-2 weeks.
Flags: needinfo?(anygregor)
Attachment #8688962 - Flags: feedback?(lissyx+mozillians)
I'm rebuilding uptodate without the workaround from Cwiiis and I will be retesting the latest patch.
Comment on attachment 8688962 [details] [diff] [review]
crash.patch

So I have performed some testing using:
> for i in $(seq 60); do (echo "Making gaia ... "; make PRODUCTION=0 MOZILLA_OFFICIAL=0 NO_LOCK_SCREEN=1 NOFTU=1 reset-gaia 2>/dev/null 1>/dev/null; echo "Gaia booting #${i}, waiting 60 secs now ..."; sleep 60) | ts "%b %d %H:%M:%S |>"; done;

Over 42 runs with the fix, there was no IPC blob error in the logs.
Over 16 runs without the fix, there was 4 IPC blob errors in the logs. I am continuing the testing to assert the frequency, but so far it would confirm the fix is good.
After 42 retries without the fix, I got 10 IPC blob errors ...
Comment on attachment 8688962 [details] [diff] [review]
crash.patch

I could not reproduce the issue with that. Without, I get a IPC error in a reliable and reproductible way.
Attachment #8688962 - Flags: feedback?(lissyx+mozillians) → feedback+
Comment on attachment 8688962 [details] [diff] [review]
crash.patch

Review of attachment 8688962 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is probably fine. Just make sure you test with artificial slowdowns on PBackground to be sure (e.g. play around with http://mxr.mozilla.org/mozilla-central/source/dom/indexedDB/ActorsParent.cpp#245).

::: dom/base/StructuredCloneHolder.cpp
@@ +562,5 @@
>  
>    if (!aManager) {
>      aManager = BackgroundChild::GetForCurrentThread();
> +    if (!aManager) {
> +      return blobImpl.forget();

It seems really weird to call a function named 'EnsureBlobForBackgroundManager' and expect it to do nothing if there's no PBackgroundChild available... Why make this change?

@@ +670,5 @@
>  
>    RefPtr<BlobImpl> blobImpl = EnsureBlobForBackgroundManager(aBlob->Impl());
>    MOZ_ASSERT(blobImpl);
>  
> +  if (NS_WARN_IF(NS_FAILED(blobImpl->SetMutable(false)))) {

This can be MOZ_ALWAYS_TRUE() right?

::: dom/ipc/Blob.cpp
@@ +1752,5 @@
>  
>    BlobChild* mActor;
>    nsCOMPtr<nsIEventTarget> mActorTarget;
>  
> +  RefPtr<BlobImpl> mDifferentProcessBlobImpl;

This needs a big comment explaining why this exists, how long it lives, etc., because this kinda goes against the whole design.
Attachment #8688962 - Flags: review?(bent.mozilla) → review+
Flags: needinfo?(bent.mozilla)
(In reply to Pulsebot from comment #86)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/385c7069dca8

This was a patch with the wrong bug number and has since been backed out. Please disregard. :)
Wow, thanks bent.

> It seems really weird to call a function named
> 'EnsureBlobForBackgroundManager' and expect it to do nothing if there's no
> PBackgroundChild available... Why make this change?

This EnsureBlobForbackgroundManager can be called on the main-thread as well, but nothing guaranties that PBackground is set up there yet. Soon PBackground will be initialized by default on main-thread and this check will go away.
The bad news is that the patch has been backed out. But the good news is that this patch triggers bug 1226477 more frequently.
Flags: needinfo?(amarchesini)
Backed out for static analysis bustage on Linux and OS X (Windows has infrastructure issues at the moment):
Backout: https://hg.mozilla.org/integration/mozilla-inbound/rev/3cec4367c10b

Bustage: https://treeherder.mozilla.org/logviewer.html#?job_id=17833382&repo=mozilla-inbound

/builds/slave/m-in-m64-st-an-d-0000000000000/build/src/dom/ipc/Blob.cpp:3044:3: error: 'AddRef' cannot be called on the return value of 'operator->'
make[5]: *** [Blob.o] Error 1
make[5]: *** Waiting for unfinished jobs....
make[4]: *** [dom/ipc/target] Error 2
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [compile] Error 2
make[2]: *** [default] Error 2
make[1]: *** [realbuild] Error 2
make: *** [build] Error 2
Return code: 2
Flags: needinfo?(amarchesini)
https://hg.mozilla.org/mozilla-central/rev/c33072613b5e
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
All icons were visible when switching between homescreens on Flame 2.6 and Aries 2.6.

Environmental Variables:
Device: Aries 2.6 [Full Flash]
BuildID: 20151201122021
Gaia: 59c8605876736b22acaaed25be00008e452149cb
Gecko: 89ae375e24d7560f613b75a585e9ed4238604494
Gonk: a19052e4389c3ae2d8fc3e7a74a475401baacc56
Version: 45.0a1 (2.6) 
Firmware Version: D5803_23.1.A.1.28_NCB.ftf
User Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0

Device: FlameKK 2.6 [Full Flash][512mb]
BuildID: 20151201030222
Gaia: 59c8605876736b22acaaed25be00008e452149cb
Gecko: 66a6d7ec9534b9d7847b665142fef0dd87623768
Gonk: 205ac4204bbbb2098a8046444acba551ba5dc75a
Version: 45.0a1 (2.6) 
Firmware Version: v18D
User Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0

Result:
All icons are visible.
Status: RESOLVED → VERIFIED
Flags: needinfo?(jmercado)
QA Whiteboard: [QAnalyst-Triage+]
Flags: needinfo?(jmercado)
Attached file Remove work-around
This removes the work-around from the component, I'll add a PR for gaia merge/uplift once this is merged.
Flags: needinfo?(chrislord.net)
Attachment #8694847 - Flags: review?(gmarty)
Comment on attachment 8694847 [details] [review]
Remove work-around

r+. Glad we finally removed that hack.
Attachment #8694847 - Flags: review?(gmarty) → review+
Flags: needinfo?(amarchesini)
Comment on attachment 8703593 [details] [review]
[gaia] Cwiiis:bug1211266-remove-workaround > mozilla-b2g:master

Update gaia-site-icon to remove bug work-around.
Attachment #8703593 - Flags: review?(gmarty)
Comment on attachment 8703593 [details] [review]
[gaia] Cwiiis:bug1211266-remove-workaround > mozilla-b2g:master

Looks good!
Attachment #8703593 - Flags: review?(gmarty) → review+
Did the platform fix get uplifted to b2g-44?
Flags: needinfo?(amarchesini)
Comment on attachment 8688962 [details] [diff] [review]
crash.patch

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Cloning blobs between threads
User impact if declined: a crash
Testing completed: yes, but there is not a mochitest because this is a race condition
Risk to taking this patch (and alternatives if risky): no big risk. This patch has been tested for quite a while.
String or UUID changes made by this patch: none
Flags: needinfo?(amarchesini)
Attachment #8688962 - Flags: approval‑mozilla‑b2g44?
Comment on attachment 8703593 [details] [review]
[gaia] Cwiiis:bug1211266-remove-workaround > mozilla-b2g:master

This is actually a b2g-2.5 uplift request, but you can't do that on a Gecko bug. Assuming that the platform bug is uplifted (as it should, given this has long been nominated a 2.5 blocker), this should also be uplifted to Gaia.

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): The work-around for this bug can itself, rarely, cause missing icons, due to a small logic error.
User impact if declined: Occasionally missing bookmark icons.
Testing completed: On master, manually tested, this just removes a work-around in Gaia that is no longer necessary when the platform bug is fixed.
Risk to taking this patch (and alternatives if risky): Low
String or UUID changes made by this patch: None
Attachment #8703593 - Flags: approval‑mozilla‑b2g44?
Attachment #8688962 - Flags: approval‑mozilla‑b2g44?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: