Closed
Bug 1319278
Opened 8 years ago
Closed 7 years ago
Crash in OOM | large | mozalloc_abort | mozalloc_handle_oom | moz_xrealloc | nsTArray_base<T>::EnsureCapacity<T> | mozilla::dom::workers::WorkerPrivateParent<T>::DispatchControlRunnable
Categories
(Core :: DOM: Workers, defect)
Tracking
()
RESOLVED
FIXED
mozilla54
People
(Reporter: erahm, Assigned: bkelly)
References
(Depends on 1 open bug)
Details
(Keywords: crash, Whiteboard: [MemShrink:P2])
Crash Data
Attachments
(5 files, 9 obsolete files)
2.28 KB,
patch
|
baku
:
review+
|
Details | Diff | Splinter Review |
7.88 KB,
patch
|
baku
:
review+
|
Details | Diff | Splinter Review |
4.42 KB,
patch
|
bkelly
:
review+
|
Details | Diff | Splinter Review |
3.02 KB,
patch
|
bkelly
:
review+
|
Details | Diff | Splinter Review |
5.95 KB,
patch
|
bkelly
:
review+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-f04a6930-7ca1-4b34-9cc1-1eedd2161122. ============================================================= This is an OOM large where |mozilla::dom::workers::WorkerPrivateParent<T>::DispatchControlRunnable| is pushing a runnable into it's queue and is OOMing due to an attempt to resize to 100's of MiB, indicating there 10's of millions of events queued. Almost all crashes are on Android. The few stacks I've looked at all appear to involve a timer event being pushed into the queue. I'm guessing somehow the worker has become unresponsive but a timer is still spinning and pushing events. There are a few things that can be done here: 1) Make the allocation fallible 2) Add memory reporting of the queue size 3) Figure out what timer is spamming the queue
Comment 1•8 years ago
|
||
Ben, do you have any ideas here? Would your recent throttling work help?
Flags: needinfo?(bkelly)
Assignee | ||
Comment 2•8 years ago
|
||
No, the throttling is for runnables coming from the worker thread to the main thread. This is a timer runnable being dispatched from the timer thread to the worker thread. Possible theories: 1. The WorkerPrivate is bogus when the timer fires and we're operating on corrupt memory. 2. The worker is frozen, but we keep adding control runnables for some reason. Are these all timer runnables? 3. We legitimately have all these runnables and the worker Queue class is just not that great. It should probably use nsEventQueue instead of nsTArray internally. 4. The worker thread is not getting a chance to run because its at low priority compared to other threads. Andrea, do you have any time to look at this?
Flags: needinfo?(bkelly) → needinfo?(amarchesini)
Reporter | ||
Comment 3•8 years ago
|
||
(In reply to Ben Kelly [:bkelly] from comment #2) > 1. The WorkerPrivate is bogus when the timer fires and we're operating on > corrupt memory. This seems unlikely as we're consistently increasing the array size (vs blowing up on some address of e5e5e5e5). > 2. The worker is frozen, but we keep adding control runnables for some > reason. Are these all timer runnables? The stacks I've checked are all dying when adding a timer runnable, so that that seems about right. > 3. We legitimately have all these runnables and the worker Queue class is > just not that great. It should probably use nsEventQueue instead of > nsTArray internally. The Queue class is definitely a bit over-engineered, a nsDeque could be subbed as well. > 4. The worker thread is not getting a chance to run because its at low > priority compared to other threads. That's definitely possible, but we're talking about millions of events. If we just say naively we're doubling the array size, an allocation of 160MiB fails, so we were going from 80MiB on a 32-bit system (these are mostly fennec), that's ~20 million events already queued. With an uptime of ~1 hour that's ~6,000 events per second so something seems horribly off. I guess one thing to do is audit whether or not the mutex is held when accessing the Queue (the class is misleading, in it's default instantiation it is not thread safe).
Assignee | ||
Comment 4•8 years ago
|
||
Well, we regularly fire these timer runnable to execute GC. It may just often be the final straw. We don't really know what the contents of the control runnable queue are, though.
Assignee | ||
Comment 5•7 years ago
|
||
So the dom/workers is using TimerThreadEventTarget() which has its own interesting Dispatch() implementation: https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#1569 The interesting thing here is that we run the nsTimerEvent runnable's Run() immediately. This allows the nsTimerImpl to reschedule period timers before we actually execute any work on the worker thread: https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsTimerImpl.cpp#498 And we actually use TYPE_REPEATING_SLACK for garbage collection timers: https://dxr.mozilla.org/mozilla-central/source/dom/workers/WorkerPrivate.cpp#4812 This is bad because its conceivable garbage collections could take longer than our interval of 1 second. In these cases we will build up GC timer events and potentially never catch up. I think we should probably use ONE_SHOT timers for GC's and manually reschedule after we execute work on the actual worker thread.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Flags: needinfo?(amarchesini)
Assignee | ||
Comment 6•7 years ago
|
||
Note, the 1 second GC falling behind is even more believable given this is largely a fennec crash.
Assignee | ||
Comment 7•7 years ago
|
||
This probably impacts our service worker stability on android devices.
Blocks: ServiceWorkers-stability
Assignee | ||
Comment 8•7 years ago
|
||
Actually, I think we should re-write TimerThreadEventTarget. That is probably not something we want to uplift, though. To mitigate the problem on release channels I'll add a separate patch where we use a pref set the worker GC timer intervals. If we increase these intervals I think it should reduce the number of OOMs.
Reporter | ||
Updated•7 years ago
|
Whiteboard: [MemShrink] → [MemShrink:P2]
Assignee | ||
Comment 9•7 years ago
|
||
First step adding the new event target.
Assignee | ||
Comment 10•7 years ago
|
||
Attachment #8826342 -
Attachment is obsolete: true
Assignee | ||
Comment 11•7 years ago
|
||
Assignee | ||
Comment 12•7 years ago
|
||
This makes worker GC use the new event target. I still need to fix the ServiceWorkerPrivate usage of TimerThreadEventTarget and then remove all the old code.
Assignee | ||
Comment 13•7 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=88631be90b1d28a7bfb51cf49bb7a5a176817980
Assignee | ||
Comment 14•7 years ago
|
||
I think I just need a cleanup patch now to remove the newly unused code.
Assignee | ||
Comment 15•7 years ago
|
||
Attachment #8828509 -
Attachment is obsolete: true
Assignee | ||
Comment 16•7 years ago
|
||
Attachment #8828510 -
Attachment is obsolete: true
Assignee | ||
Comment 17•7 years ago
|
||
Attachment #8828512 -
Attachment is obsolete: true
Assignee | ||
Comment 18•7 years ago
|
||
Attachment #8828978 -
Attachment is obsolete: true
Assignee | ||
Comment 19•7 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f761b505634de8c9d7db7e3a95e5dd47019e076
Assignee | ||
Comment 20•7 years ago
|
||
Hit a failure in last try because I did not call the default WorkerRunnable::Cancel() in my override. Lets try again. https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fd031c0a8841a7b5cedf524639295566628c6ca
Attachment #8829635 -
Attachment is obsolete: true
Assignee | ||
Comment 21•7 years ago
|
||
Comment on attachment 8829655 [details] [diff] [review] P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku This adds new nsIEventTarget implementation that wraps a given nsIRunnable and dispatches it as a WorkerControlRunnable. This is different from TimerThreadEventTarget in an important way. The new event target does not run the source runnable until the worker thread can actually execute it. The TimerThreadEventTarget uses a "dummy" callback that runs immediately. This differences allows nsTimer's backpressure mechanism to work. If the worker thread is slow to respond then nsTimer will wait to schedule the next interval. The goal here is to avoid the OOM by not continuing to dispatch GC runnables to the worker thread while the worker thread is too busy to process any of them.
Attachment #8829655 -
Flags: review?(amarchesini)
Assignee | ||
Comment 22•7 years ago
|
||
Comment on attachment 8829636 [details] [diff] [review] P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku Add the new WorkerControlEventTarget in a WorkerPrivate. We will use it later patches.
Attachment #8829636 -
Flags: review?(amarchesini)
Assignee | ||
Comment 23•7 years ago
|
||
Comment on attachment 8829637 [details] [diff] [review] P3 Make worker GC use new WorkerControlEventTarget. r=baku Modify worker GC to use the new event target. The stale worker gc bits will be removed in a later patch.
Attachment #8829637 -
Flags: review?(amarchesini)
Assignee | ||
Comment 24•7 years ago
|
||
Comment on attachment 8829638 [details] [diff] [review] P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku Make ServiceWorkerPrivate use the new WorkerControlEventTarget. In addition, I modified this to use WorkerHolder instead of busy count.
Attachment #8829638 -
Flags: review?(amarchesini)
Assignee | ||
Comment 25•7 years ago
|
||
Comment on attachment 8829640 [details] [diff] [review] P5 Removed unused TimerThreadEventTarget. r=baku Remove all the orphaned code. This deletes TimerThreadEventTarget and the various WorkerPrivate members associated with it.
Attachment #8829640 -
Flags: review?(amarchesini)
Assignee | ||
Comment 26•7 years ago
|
||
BTW, I see various other crashes that look like TimerThreadEventTarget. I'm hopeful these patches will fix those issues as well.
Comment 27•7 years ago
|
||
Comment on attachment 8829655 [details] [diff] [review] P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku Review of attachment 8829655 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/workers/WorkerPrivate.cpp @@ +1615,5 @@ > +{ > + mozilla::Mutex mMutex; > + WorkerPrivate* mWorkerPrivate; > + > + ~WorkerControlEventTarget(); = default; @@ +1637,5 @@ > +WorkerControlEventTarget::WorkerControlEventTarget(WorkerPrivate* aWorkerPrivate) > + : mMutex("WorkerControlEventTarget") > + , mWorkerPrivate(aWorkerPrivate) > +{ > + MOZ_DIAGNOSTIC_ASSERT(mWorkerPrivate); Why not only having all these methods inline? @@ +1663,5 @@ > +class WrappedControlRunnable final : public WorkerControlRunnable > +{ > + nsCOMPtr<nsIRunnable> mInner; > + > + ~WrappedControlRunnable() = default;
Attachment #8829655 -
Flags: review?(amarchesini) → review+
Updated•7 years ago
|
Attachment #8829636 -
Flags: review?(amarchesini) → review+
Updated•7 years ago
|
Attachment #8829637 -
Flags: review?(amarchesini) → review+
Updated•7 years ago
|
Attachment #8829638 -
Flags: review?(amarchesini) → review+
Updated•7 years ago
|
Attachment #8829640 -
Flags: review?(amarchesini) → review+
Assignee | ||
Comment 28•7 years ago
|
||
Attachment #8829655 -
Attachment is obsolete: true
Attachment #8830750 -
Flags: review+
Assignee | ||
Updated•7 years ago
|
Attachment #8830750 -
Attachment description: Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku → P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku
Assignee | ||
Comment 29•7 years ago
|
||
Attachment #8829637 -
Attachment is obsolete: true
Attachment #8830752 -
Flags: review+
Assignee | ||
Comment 30•7 years ago
|
||
Attachment #8829640 -
Attachment is obsolete: true
Attachment #8830754 -
Flags: review+
Comment 31•7 years ago
|
||
Pushed by bkelly@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/6e1f8f0c4d94 P1 Add a WorkerControlEventTarget that executes runnables as WorkerControlRunnables. r=baku https://hg.mozilla.org/integration/mozilla-inbound/rev/e20b4c55bd6a P2 Make the WorkerPrivate create a WorkerControlEventTarget. r=baku https://hg.mozilla.org/integration/mozilla-inbound/rev/e881b33252de P3 Make worker GC use new WorkerControlEventTarget. r=baku https://hg.mozilla.org/integration/mozilla-inbound/rev/6196560e5967 P4 Make ServiceWorkerPrivate use new WorkerControlEventTarget instead of TimeThreadEventTarget. r=baku https://hg.mozilla.org/integration/mozilla-inbound/rev/d4f4afc096fd P5 Removed unused TimerThreadEventTarget. r=baku
Assignee | ||
Comment 32•7 years ago
|
||
[Tracking Requested - why for this release]: I want to let this bake on nightly for a few days, but we should consider uplifting it to fix the crashes. Its is more risky, though, than a nullptr fix.
status-firefox51:
--- → wontfix
status-firefox52:
--- → affected
status-firefox53:
--- → affected
status-firefox54:
--- → affected
status-firefox-esr45:
--- → wontfix
tracking-firefox52:
--- → ?
tracking-firefox53:
--- → ?
Comment 33•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/6e1f8f0c4d94 https://hg.mozilla.org/mozilla-central/rev/e20b4c55bd6a https://hg.mozilla.org/mozilla-central/rev/e881b33252de https://hg.mozilla.org/mozilla-central/rev/6196560e5967 https://hg.mozilla.org/mozilla-central/rev/d4f4afc096fd
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Comment 34•7 years ago
|
||
Might be worth to document somewhere in Worker code that WorkerControlEventTarget behaves as it behaves with timers because of nsTimerEvent's implementation
Assignee | ||
Comment 35•7 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #34) > Might be worth to document somewhere in Worker code that > WorkerControlEventTarget behaves as it behaves with timers because of > nsTimerEvent's implementation Well, it doesn't really. It behaves per the nsIEventTarget contract now. When you dispatch a runnable it doesn't call Run() until it actually runs on the target. It was the old code I removed that was doing something non-standard and hidden. This broke one of the features of nsTimer because it expected standard nsIEventTarget behavior.
Assignee | ||
Comment 37•7 years ago
|
||
I've noticed a few worker related test timeouts this morning. I don't know if they are related to this, bug 1216175, or something else, but we should watch them. Bug 1334378 Bug 1334379 Bug 1334383
Comment 38•7 years ago
|
||
Ben, any conclusions from the test timeouts or results on crash-stats? Would you like to try at least an aurora uplift?
Flags: needinfo?(bkelly)
Assignee | ||
Comment 39•7 years ago
|
||
Of those 3 bugs only bug 1334383 has triggered more than. It does seem it could be caused by this landing, though. I don't think we should uplift until we can look at that.
Depends on: 1334383
Flags: needinfo?(bkelly)
Assignee | ||
Comment 40•7 years ago
|
||
[Tracking Requested - why for this release]: Do we still want to track this for FF52 and FF53? It appears the fix in this bug created some low frequency assertion crashes in bug 1334383. I think we can fix these for FF54, but they might be risky to uplift. +froydnj since this decision depends on our ability to uplift xpcom timer thread changes.
Comment 41•7 years ago
|
||
Ben, if it turns out this is still an issue can you request uplift? Do you need help from QA?
Assignee | ||
Comment 42•7 years ago
|
||
I don't think its safe to uplift this patch. We need to fix some fallout from the change in trunk and I doubt those changes could be easily uplifted either.
Flags: needinfo?(bkelly)
Updated•7 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•