Open Bug 1413172 Opened 7 years ago Updated 2 years ago

Analyze counts of shutdownkill crash pings

Categories

(Toolkit :: Telemetry, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: chutten, Unassigned)

References

Details

With bug 1410143 we can now determine which crash pings are caused by shutdown kills. Use this information to cross-check the counts reported by crash_aggregates' content_shutdown_crashes and see what we can learn to support the investigation in bug 1393107.
I have looked at content crashes according to:

crash_summary (ie, crash pings): https://sql.telemetry.mozilla.org/queries/49088/source#table
crash_aggregates (ie, main pings): https://sql.telemetry.mozilla.org/queries/49091/source#table

(These are both filtering to include only submissions from nightly builds new enough to have the new ipc_channel_error stuff from bug 1410143)

They... do not quite line up. For instance, on Nov 21, crash_aggregates reports 12,579 total content crashes of which 10,126 of them are shutdown kills (for a total of 2453 non-kill content crashes). On the same day, crash_summary reported 3798 "shutdown web" crashes (web content crashes that are kills) and 2757 "not shutdown web" crashes (web content crashes that aren't kills).

2453 vs 2757 is pretty close. But 10126 vs 3798 is _way off_.

So, a couple of notes.

1) These are both using submission_date, not activity_date. This is not as analogous as it might seem as the submission_date for crash_aggregates is based on "main" ping submission (which may take a little while for content crashes), whereas for crash_summary it is "crash" ping submission (basically instantaneous). To account for this I guess I should rerun this with activity_date, which ought to bring the 2453 vs 2757 numbers closer together, if there is any good in this world.

2) We _know_ that content_shutdown_crashes is overcounting (see bug 1393107). Is it overcounting by 3x, though?

3) Even counting _all_ kinds of content-process shutdown crashes, that only brings 3798 up by another 425 crashes. If we include "null"-type crashes (which don't have a remoteType), that's only another 1417. Sum it all up and we're still only at 5640 shutdown kills as reported by crash pings.

Next steps:

1) Repeat the analysis using subsession_start_date as the activity_date for crash_summary, and activity_date directly from crash_aggregates to see if the numbers shift more into or out of alignment. (keeping in mind that crash_summary and crash_aggregates still won't have the same view of the world, especially for recent events given the latency disparity... but for anything more than Two Days old it should be fine)

2) ni?gsvelto for his opinion on what these numbers may mean, any cases he can think of where crash pings may misbehave, or any faults in my queries (like how I'm just pretending that NULL is an acceptable value in my results table)
Flags: needinfo?(gsvelto)
Quick wild guess: we send crash pings only for content (as in web, extension or file) and GPU crashes. Is it possible that crash_aggregates is counting plugin and gmpplugin crashes as content crashes? If that's the case then it could explain the discrepancy.
Flags: needinfo?(gsvelto)
plugin and gmplugin crashes are counted separately from content crashes, and likely separately from content shutdown crashes, too.

crash_aggregates uses the keyed histogram SUBPROCESS_CRASHES_WITH_DUMP to determine how many crashes are from how many processes[1], and uses the key "content" for content crashes. This should have one-to-one correspondence with RemoteType=web crashes (maybe plus RemoteType=extension,webLargeAllocation,file), the code[2] seems to lump all content processes into the same bucket, while keeping plugin/gmplugin/gpu separate.

But as for shutdown kills... The code for that lives in ContentParent[3] so it stands to reason that it counts RemoteType=web,extension,webLargeAllocation,file kinds of processes. I don't think plugin/gmplugin apply.

[1]: https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/CrashAggregateView.scala#L329
[2]: https://searchfox.org/mozilla-central/rev/a5d613086ab4d0578510aabe8653e58dc8d7e3e2/ipc/glue/CrashReporterHost.cpp#252
[3]: https://searchfox.org/mozilla-central/source/dom/ipc/ContentParent.cpp#3044
(In reply to Chris H-C :chutten from comment #3)
> crash_aggregates uses the keyed histogram SUBPROCESS_CRASHES_WITH_DUMP to
> determine how many crashes are from how many processes[1], and uses the key
> "content" for content crashes. This should have one-to-one correspondence
> with RemoteType=web crashes (maybe plus
> RemoteType=extension,webLargeAllocation,file), the code[2] seems to lump all
> content processes into the same bucket, while keeping plugin/gmplugin/gpu
> separate.

Interesting, the crash ping is sent by t he crash service just the line above the Telemetry::Accumulate() invocation so they should match... unless the ping is not sent but that would mean that telemetry is unable to send a very large number of pings which is unlikely.

> But as for shutdown kills... The code for that lives in ContentParent[3] so
> it stands to reason that it counts
> RemoteType=web,extension,webLargeAllocation,file kinds of processes. I don't
> think plugin/gmplugin apply.

Yeah, they shouldn't.
(I erroneously mentioned that crash pings would have a subsession_start_date in comment 1. They do not. Instead they have crashdate, which is the day the crash occurred)

crash_aggregates, when looking at activity_date has: 12,751 content crashes, of which 10,111 are shutdown. (2640 not-shutdown)

crash_summary, when looking at crashdate has: 2749 not-shutdown web content crashes, 3721 shutdown web content crashes (plus 409 extension shutdown, 2 webLargeAllocation shutdown, and 43 file shutdown)

Looking at activity-date-based figures brings the numbers no closer together than submission-date-based figures. This may be unavoidable due to the differences between the crashdate annotation in the crash ping and the subsession_start_date field in the main ping.

Next steps:
Determine if maybe there's a problem with shutdownkill crash pings not being sent in appropriate volume. With a 3x discrepancy, I wonder if maybe we aren't sending multiple shutdownkills for the e10s-multi case when multiple content processes are taking too long to shut down...
:ted, would you happen to know if there's some sort of coalescing behaviour in crash management when multiple content processes are ShutDownKill'd simultaneously?
Flags: needinfo?(ted)
Priority: P2 → P3
I do not, sorry! (I have always been a bit fuzzy on the e10s crashreporter interaction bits.)
Flags: needinfo?(ted)
Probably due to this bug, it seems like the content_shutdown_crashes measure in error_aggregates is frequently calculated as being higher than content_crashes:

https://sql.telemetry.mozilla.org/queries/52261/source

Discussion on IRC:

https://mozilla.logbot.info/missioncontrol/20180327#c14526479
It seems as though my theory of "we send one crash ping per process that crashes but count in main pings only one crash per browser instance" isn't consistent with the facts.

1) set dom.ipc.tabs.shutdownTimeoutSecs to 1
2) load firefox.com, mozilla.org, eff.org in tabs
3) close Firefox
4) Check the "main" ping with reason "shutdown" for payload.keyedHistograms.SUBPROCESS*

I got SUBPROCESS_ABNORMAL_ABORT/content with a sum of 3 and SUBPROCESS_KILL_HARD/ShutDownKill with a sum of 3. (I would have had SUBPROCESS_CRASHES_WITH_DUMP/content with 3 instead of ABORT, but I can't figure out how to properly enable the crash reporter on my Firefox build)

This is inconsistent with my hypothesis.
Possible Next step: Longitudinal study of crashing clients' information flow.

For a client with a shutdownkill, a client with a non-shutdownkill content crash:
At what point do we receive their crash pings? How many?
At what point do we receive their main ping that contains SHUTDOWN_KILL_HARD and SUBPROCESS_CRASHES_WITH_DUMP? Are they both on the same ping? How many do they count?

It could be that we're seeing the same information in all of these counts, but at different times.
It could be that we're seeing different information in some of these counts.

I'm busy with other stuff this week, so I'm unassigning this.
Assignee: chutten → nobody
Status: ASSIGNED → NEW
https://github.com/mozilla/missioncontrol/issues/203 mentions that this is needed for the overall crash rate for the MC dashboard. chutten - Is this something you would be able to work on after you finish your other work?
Flags: needinfo?(chutten)
I hope to, but I don't know when this will reach the top of the pile :S
Flags: needinfo?(chutten)
During discussion today with Will we agreed this is needed before we deprecate AWSY.
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #13)
> During discussion today with Will we agreed this is needed before we
> deprecate AWSY.

Another alternative would be to fix bug 1453485. This would make the miscounted shutdown crashes less important. We'd still want to fix this, but I think that would be enough to migrate away from AWSY.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.