Closed Bug 1236108 Opened 8 years ago Closed 8 years ago

Crash report memory annotations not implemented for child processes (content processes)

Categories

(Toolkit :: Crash Reporting, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
e10s m9+ ---
firefox45 --- wontfix
firefox46 + fixed
firefox47 --- fixed
firefox48 --- fixed

People

(Reporter: mccr8, Assigned: bugzilla)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink:P1])

Attachments

(3 files, 1 obsolete file)

While looking at the crash reports in bug 1235633, I noticed that crash signature is "OOM | unknown" rather than "OOM | large" or "OOM | small", which is odd, because I've seen a crash at the same allocation site before and it was annotated properly. Then I looked at the individual crash signatures and none of the ones I looked at have the nice memory information (Total Virtual Memory, Available Virtual Memory, System Memory Use Percentage, etc.) that we usually get in a crash report. I'm guessing the "unknown" results from the allocation size field being missing, too.

If you click on the "experiment" link in bug 1234647 comment 0, every OOM crash signature is "unknown", so it looks like this isn't working at all.

The odd thing is, if I look at the top 300 crashes from Nightly or Aurora, there are no "OOM | unknown", so I'm not sure if there's a problem specific to beta with these memory annotations or what.

These memory annotations are important to understanding our OOM crashes.
Benjamin, do you know who might be able to look into this now that dmajor has gone? Thanks.
tracking-e10s: --- → ?
Flags: needinfo?(benjamin)
Whiteboard: [MemShrink]
It turns out the experiment link isn't aggregating crashes together, so it isn't an apples-to-apples comparison, but still, in the first three pages of crashes I found a number of "OOM | unknown" crashes, but no other "OOM" crashes.
The "unknown" ones seem to have a "Process Type" field "content" and the other ones don't, so maybe we aren't properly recording the memory information for content process OOM crashes? Depending on how exactly that information is being generated and recorded, I could imagine there being some kind of sandboxing issues.
This is about content processes. We record the memory information here for main-process crashes: http://hg.mozilla.org/mozilla-central/annotate/0771c5eab32f/toolkit/crashreporter/nsExceptionHandler.cpp#l833

But since content process crashes don't go through this code path (the parent writes the annotations on behalf of the child process, rather than the child writing them directly), these aren't recorded.
Flags: needinfo?(benjamin)
Summary: Crash report memory annotations not working in e10s beta experiment → Crash report memory annotations not implemented for child processes (content processes)
Georg, is this something you can take care of?
Blocks: e10s-perf
Flags: needinfo?(gfritzsche)
Priority: -- → P4
I'll take it.
Assignee: nobody → aklotz
Status: NEW → ASSIGNED
Flags: needinfo?(gfritzsche)
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #6)
> I'll take it.

Thanks
Whiteboard: [MemShrink] → [MemShrink:P1]
This is substantial to be able to do useful crash comparisons and to know which OOM crashes are worth looking at, as "small" (<256k) OOMs are pretty much unactionable (just a sign that we generally are running out of memory), while large ones probably need to be worked on.
Sounds like we need to revisit the P4 classification here.
Flags: needinfo?(jgriffiths)
Based on discussion today, I think this is a P1. It's already assigned and when landed should get uplifted to 46 to catch our Beta cycle there.
Flags: needinfo?(jgriffiths)
Priority: P4 → P1
Blocks: 1250637
This code is super-close to being reviewable, but I need to sort out problems with the regression test that I wrote for it. Stand by...
The previous patch in this series creates a new directory service entry
specifically for obtaining the content process temp directory.

This patch converts everything else to reference that entry. It also sets
appropriate environment variables in the content processes so that system
APIs automatically pick up the directory. This is necessary for the crash
reporter to be able to call those APIs in exception handling contexts.

Review commit: https://reviewboard.mozilla.org/r/37625/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/37625/
Attachment #8725778 - Flags: review?(haftandilian)
Attachment #8725778 - Flags: review?(bobowen.code)
This patch redefines XP_PATH_MAX on Windows to be MAX_PATH + 1. I did this
because the longer definition would actually not work with most Windows APIs.
Some APIs can work with longer lengths if the path is prefixed with "\\?\", but
that is not guaranteed in general.

See https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247(v=vs.85).aspx#maxpath

Review commit: https://reviewboard.mozilla.org/r/37629/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/37629/
Attachment #8725780 - Flags: review?(benjamin)
Comment on attachment 8725777 [details]
MozReview Request: Bug 1236108: Add temp directory for sandboxed content processes to directory

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37623/diff/1-2/
Attachment #8725777 - Attachment description: MozReview Request: Bug 1236108: Add temp directory for low-integrity content processes to directory service; r?bsmedberg → MozReview Request: Bug 1236108: Add temp directory for sandboxed content processes to directory service; r?bsmedberg
Comment on attachment 8725778 [details]
MozReview Request: Bug 1236108: Modify sandbox initialization code to use directory service to obtain content process temp directory; r?bobowen, haik

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37625/diff/1-2/
Comment on attachment 8725779 [details]
MozReview Request: Bug 1236108: Add Windows sandbox policy for exception-context crash report annotations; r?bobowen

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37627/diff/1-2/
Comment on attachment 8725780 [details]
MozReview Request: Bug 1236108: Add support for exception-context annotations for content processes to the crash reporter; r?bsmedberg

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37629/diff/1-2/
Comment on attachment 8725778 [details]
MozReview Request: Bug 1236108: Modify sandbox initialization code to use directory service to obtain content process temp directory; r?bobowen, haik

https://reviewboard.mozilla.org/r/37625/#review34377

When I originally created the low integrity temp, I was setting TEMP and TMP.
I had to get rid of this as it was causing locking problems when I was trying to delete the temp directory in the content process.
At that point it was creating a unique per process dir, so this meant they got left lying around (I added temporary code to clean these up).

This shouldn't be an issue now that the deletion happens in the parent process as any locks should have gone.

However, I still decided not to add it back in straight away because I was unsure whether it was the best thing to do.
If we override these then other things pick it up, they might not be expecting the directory to get deleted all the time.

See bug 1166637 for example.
I suggested using sandboxing rules for this in the description, but I'm less sure that's a good idea.

Maybe we could create a separate AppData\LocalLow\Mozilla\Temp and override TEMP and TMP to that.
Of course that wouldn't get cleared and people would not know that it was something they might clear manually.

So, perhaps the always deleted solution is best, even if it would only give benefits like caching for that process.

::: dom/ipc/ContentProcess.cpp:52
(Diff revision 2)
> +  NS_WARN_IF(!SetEnvironmentVariableW(L"TMP", fullTmpPath.get()));

Should we be setting TEMP here as well?
Attachment #8725778 - Flags: review?(bobowen.code)
Comment on attachment 8725779 [details]
MozReview Request: Bug 1236108: Add Windows sandbox policy for exception-context crash report annotations; r?bobowen

https://reviewboard.mozilla.org/r/37627/#review34401

::: security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp:208
(Diff revision 2)
> +      mPolicy->AddRule(sandbox::TargetPolicy::SUBSYS_FILES,

I don't understand why we would need this if we are using a low integrity directory.

Either way, GetTempPath will be different and this would be the pid of the parent not child.
Attachment #8725779 - Flags: review?(bobowen.code)
I meant to say thanks for cleaning up the getting of the directory in the first place.
I originally hadn't put things into the directory service, because I didn't think I could use prefs there, but I hadn't thought of using the app one.
Attachment #8725778 - Flags: review?(haftandilian) → review+
Comment on attachment 8725778 [details]
MozReview Request: Bug 1236108: Modify sandbox initialization code to use directory service to obtain content process temp directory; r?bobowen, haik

https://reviewboard.mozilla.org/r/37625/#review34421

On the Mac side, looks good to me, just as long as it won't be a problem that the temp directory is removed/recreated during startup. i.e., assuming that wouldn't result in lost crash data.

I referenced another temp dir location for OS X on comment 23 of 1237847 in case that's useful: https://bugzilla.mozilla.org/show_bug.cgi?id=1237847#c23
Comment on attachment 8725780 [details]
MozReview Request: Bug 1236108: Add support for exception-context annotations for content processes to the crash reporter; r?bsmedberg

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37629/diff/2-3/
Attachment #8725779 - Attachment is obsolete: true
https://reviewboard.mozilla.org/r/37625/#review34421

Yeah, this is only used during a very short window of time between the exception being caught in content and the crash being picked up in chrome. It doesn't need to persist any longer than that.
https://reviewboard.mozilla.org/r/37625/#review34377

It sounds like we should QA this at least?

> Should we be setting TEMP here as well?

That shouldn't be necessary: GetTempPath searches the environment variables in a specific order, and uses the first one that it finds. TMP is the first one that it checks.
https://reviewboard.mozilla.org/r/37625/#review34377

> That shouldn't be necessary: GetTempPath searches the environment variables in a specific order, and uses the first one that it finds. TMP is the first one that it checks.

Unless we're worrying about third-party garbage?
https://reviewboard.mozilla.org/r/37625/#review34421

Thanks for explaining. Should we be concerned about a crash in content that somehow results in a crash in the parent before it processes the content crash? If we had to pick one, I suppose the parent crash would be more important to resolve first.
https://reviewboard.mozilla.org/r/37625/#review34377

> Unless we're worrying about third-party garbage?

Yes, I was thinking of other things using the TMP and TEMP variable and not GetTempPath.
Hopefully they wouldn't do that, but having TMP and TEMP different might cause issues if that other code is internally inconsistent, so why not set both.
Comment on attachment 8725777 [details]
MozReview Request: Bug 1236108: Add temp directory for sandboxed content processes to directory

https://reviewboard.mozilla.org/r/37623/#review34525

Please add a more descriptive commit message: explain what code you're moving around and what this dir is "for".

::: toolkit/xre/nsXREDirProvider.cpp:634
(Diff revision 2)
> +GetContentProcessTempBaseDir()

This should be Get...DirKey
Attachment #8725777 - Flags: review?(benjamin) → review+
Attachment #8725780 - Flags: review?(benjamin) → review+
Comment on attachment 8725780 [details]
MozReview Request: Bug 1236108: Add support for exception-context annotations for content processes to the crash reporter; r?bsmedberg

https://reviewboard.mozilla.org/r/37629/#review34527

r+ with corrections as posted

::: toolkit/crashreporter/nsExceptionHandler.cpp:166
(Diff revision 3)
> +static const XP_CHAR childCrashAnnotationFileName[] = XP_TEXT("GeckoChildCrash");

s/FileName/BaseName/

::: toolkit/crashreporter/nsExceptionHandler.cpp:262
(Diff revision 3)
> +static nsIFile* contentProcessTmpDir = nullptr;

1. This should be documented as being valid only in chrome process (not content)
2. This shouldn't be a nsIFile* because of shutdown issues and leak checking. xpstring (or maybe XP_CHAR) is the better choice.

::: toolkit/crashreporter/nsExceptionHandler.cpp:1186
(Diff revision 3)
> +  static XP_CHAR tempPath[XP_PATH_MAX] = {0};

Can you assert that this is a content-type process here? It helps with readability.

::: toolkit/crashreporter/nsExceptionHandler.cpp:2848
(Diff revision 3)
> +{

Assert: this runs in chrome process.

::: toolkit/crashreporter/nsExceptionHandler.cpp:2952
(Diff revision 3)
> +    AnnotationTable exceptionTimeAnnotations;

Make this a separate helper function.

::: toolkit/crashreporter/nsExceptionHandler.cpp:2986
(Diff revision 3)
> +  if (exceptionTimeExtra) {

This is a side effect which is important to document in nsExceptionHandler.h
Comment on attachment 8725777 [details]
MozReview Request: Bug 1236108: Add temp directory for sandboxed content processes to directory

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37623/diff/2-3/
Attachment #8725777 - Attachment description: MozReview Request: Bug 1236108: Add temp directory for sandboxed content processes to directory service; r?bsmedberg → MozReview Request: Bug 1236108: Add temp directory for sandboxed content processes to directory
Comment on attachment 8725778 [details]
MozReview Request: Bug 1236108: Modify sandbox initialization code to use directory service to obtain content process temp directory; r?bobowen, haik

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37625/diff/2-3/
Attachment #8725778 - Flags: review?(bobowen.code)
Comment on attachment 8725780 [details]
MozReview Request: Bug 1236108: Add support for exception-context annotations for content processes to the crash reporter; r?bsmedberg

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37629/diff/3-4/
Comment on attachment 8725778 [details]
MozReview Request: Bug 1236108: Modify sandbox initialization code to use directory service to obtain content process temp directory; r?bobowen, haik

https://reviewboard.mozilla.org/r/37625/#review35173

::: dom/ipc/ContentProcess.cpp:44
(Diff revision 3)
> -  // On Windows, the sandbox-writable temp directory resides in the
> -  // low integrity sandbox base directory.
> -  return NS_WIN_LOW_INTEGRITY_TEMP_BASE;
> +  // Save the TMP environment variable so that is is picked up by GetTempPath().
> +  // Note that we specifically write to the TMP variable, as that is the first
> +  // variable that is checked by GetTempPath() to determine its output.

nit: comment needs updating slightly maybe.
Attachment #8725778 - Flags: review?(bobowen.code) → review+
Comment on attachment 8725778 [details]
MozReview Request: Bug 1236108: Modify sandbox initialization code to use directory service to obtain content process temp directory; r?bobowen, haik

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37625/diff/3-4/
Comment on attachment 8725780 [details]
MozReview Request: Bug 1236108: Add support for exception-context annotations for content processes to the crash reporter; r?bsmedberg

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37629/diff/4-5/
https://hg.mozilla.org/mozilla-central/rev/a6ad577af025
https://hg.mozilla.org/mozilla-central/rev/19074ba43502
https://hg.mozilla.org/mozilla-central/rev/4b734a8db65c
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Depends on: 1254963
Only appears to partially have worked.
OOM Allocation Size in a couple but no other memory information.
bp-ca43d6b2-f28c-4f14-a477-ee3f32160310
bp-aa75ad1b-8595-48ad-9d81-0f0c82160310 (Thread 1)
Not even OOM Allocation Size (possibly large)
bp-db979201-a2a7-4f81-8997-434532160310 (Thread 1)
bp-00b8d782-b989-40f6-bab7-f72da2160310
Don't think this has allocation size but should have other memory information.
bp-f3dc2cb6-68d0-4916-8126-21de72160310
Just spotted after posting first two are correct, both linux; does not list memory information.
(In reply to Jonathan Howard from comment #50)
> Just spotted after posting first two are correct, both linux; does not list
> memory information.

FWIW, for the purposes of this bug, the OOMAllocationSize annotation is the only additional information that should be expected.

I'm not sure what is going on with those Windows builds, but it depends on a few factors. If we're not calculating the OOM allocation size correctly and it is being set to 0, there won't be an annotation. OTOH there could be some other kind of problem.

I'm going to wait another day and see what else comes up. If this is a persistent issue with the next day's crash reports then I'll try to investigate further.
Aaron, should we uplift this crash reporter fix to Beta 46? We're running an e10s experiment on Beta 46, so we want to be testing with all the e10s fixes for known issues.
Flags: needinfo?(aklotz)
From my side, I'm already happy if we end up with signatures that are either "OOM | ..." or "js::AutoEnterOOMUnsafeRegion::crash | ..." (though I personally would like the latter to also start with "OOM |" but that's out of scope for this bug and not e10s-dependent) - those kinds of signatures we can detect as being OOM and therefore we can start usefully comparing e10s with non-e10s crashes.
Oh, and yes, I'd love to see an uplift to 46.
(In reply to Chris Peterson [:cpeterson] from comment #52)
> Aaron, should we uplift this crash reporter fix to Beta 46? We're running an
> e10s experiment on Beta 46, so we want to be testing with all the e10s fixes
> for known issues.

I plan to uplift, yes. I'd like this to bake for one more day so that I can look at some more incoming crash reports and ensure that everything looks okay.
Flags: needinfo?(aklotz)
Sounds good.

[Tracking Requested - why for this release]:

We'd like to track this crash reporter fix for 46 because we need it to accurately monitor stability of our e10s beta experiment.
Comment on attachment 8725780 [details]
MozReview Request: Bug 1236108: Add support for exception-context annotations for content processes to the crash reporter; r?bsmedberg

(This request is for all three patches in this bug)

Approval Request Comment
[Feature/regressing bug #]: OOM crash report annotations for content processes
[User impact if declined]: More difficult for us to determine OOM crash rates with e10s
[Describe test coverage new/current, TreeHerder]: These patches add a new test which has landed and is passing on Nightly. The expected OOM annotations are now available on crash-stats for Nightly builds.
[Risks and why]: These are not trivial patches, but they are not super complex either. I'd say that the risk is fairly low, and the benefits from being able to properly classify content OOMs is worth it.
[String/UUID change made/needed]: None
Attachment #8725780 - Flags: approval-mozilla-beta?
Attachment #8725780 - Flags: approval-mozilla-aurora?
Note to Sheriffs: These patches don't push cleanly on beta. I have a rebase that I can push myself once approved.
Marking affected/tracking for 46 since we are running a beta 46 e10s experiment.
Comment on attachment 8725780 [details]
MozReview Request: Bug 1236108: Add support for exception-context annotations for content processes to the crash reporter; r?bsmedberg

Better OOM info in crash-stats sounds great to me.
Aaron, please go ahead and push your beta fix, I would love this in beta 2 which goes to build on Monday.
Attachment #8725780 - Flags: approval-mozilla-beta?
Attachment #8725780 - Flags: approval-mozilla-beta+
Attachment #8725780 - Flags: approval-mozilla-aurora?
Attachment #8725780 - Flags: approval-mozilla-aurora+
Backed out from Aurora for test_content_exception_time_annotation.js failures. Looks like you'll need to handle all the uplifts, Aaron.
https://hg.mozilla.org/releases/mozilla-aurora/rev/34a219cbd67d

https://treeherder.mozilla.org/logviewer.html#?job_id=2125699&repo=mozilla-aurora
test_content_exception_time_annotation.js | run_test/< - [run_test/< : 15] false == true
Flags: needinfo?(aklotz)
Depends on: 1256541
Bug 1256541 is blocking uplift. Leaving myself on ni? To land this once that has been taken care of.
Blocks: 1257486
Blocks: 1259195
Bug 1256541 landed in m-c, so we should be able to uplift both of these soon (for beta 6 or 7)
hey Liz, do you know which beta build this made it into?
Flags: needinfo?(lhenry)
Beta 7, released April 4th. We released Beta 8 on April 5th so there may not be very much data for beta 7.
Flags: needinfo?(lhenry)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: