Open Bug 1274628 Opened 8 years ago Updated 1 year ago

Annotate crashes when the code in memory around the crashing instruction differs from the code in the shipped binary

Categories

(Socorro :: General, task, P3)

Tracking

(Not tracked)

People

(Reporter: dbaron, Unassigned)

References

(Blocks 1 open bug)

Details

In bug 1272750 we have a minidump that crashes because the code being executed is corrupt relative to what is in the binary we ship.  We know this, I believe, because the minidump contains some of the memory around the instruction pointer.

Crashes where the code around the instruction pointer doesn't match the code in the binary are clear signs of memory corruption, and they may be a decent portion of our crashes.  It would be useful if they could be prominently marked in soccoro so that it's clear they're not worth investigating (perhaps with details showing the difference so that we can spot-check that the analysis is correct), and also so that we could get statistics on how common this problem is.

(This is perhaps an alternative to bug 1274428.  This is a much more focused heuristic for detecting the particular problem we saw there.)
Also note that this would only catch a subset of memory/disk corruption of the binary, since some corruption could lead to things like jumping elsewhere before crashing.
We'd have to exclude JIT-generated code.
See also bug 1251395, which would take this a bit further and actually add suspicious pages to the crash dump.
See Also: → 1251395
I don't see how that bug is related; this is about doing more with information we already have in minidumps.
Depends on: 1280442
Some extra info that was from bug 1280442:

> It may be that this is mismatched because of:
> 
> * function hooking (from 3rd-party)
> * bad memory
> * bad disk
> 
> In reporting, if there are mismatches we should highlight the particular disassembly situation (both original
> and modified) and show whether this is a one-bit flip or something else.

And relocations will need to be accounted for, which is finicky but doable.
Extracting the memory around the crashing instruction from the minidump sounds straightforward.

Extracting the corresponding bytes from the shipped binary for matching sounds harder. We'd need to access the binary from somewhere (in bug 1280442 ted mentioned the Windows symbol server), and we'd need to do this on every(?) Windows crash. All that sounds slow...
Summary: annotate crashes where the code around the instruction pointer is not the code that's in the binary we ship → Annotate crashes when the code in memory around the crashing instruction differs from the code in the shipped binary
> Extracting the corresponding bytes from the shipped binary for matching
> sounds harder. We'd need to access the binary from somewhere (in bug 1280442
> ted mentioned the Windows symbol server), and we'd need to do this on
> every(?) Windows crash. All that sounds slow...

I talked with lonnen about this. He said the interface between Socorro and releng infrastructure isn't great at the moment, and so getting ahold of the right binaries, while possible, would require significant effort, and it's probably not something the Socorro team is likely to be able to help with in the near future :(
A couple of other notes, for future reference:

- Bug 1124241 and bug 1268029 were about integrating jit-crash-categorize, which is a similar sort of thing and indicates how this step might fit into Socorro's processing.

- froydnj thinks that adjusting for relocations shouldn't be necessary, because text (code) doesn't have relocations on Windows.
I don't think that finding the matching binaries would be hard at all. We don't need any releng infrastructure: the binaries are already stored as part of the symbol server and can be looked up by debug ID directly from the minidump without any other metadata.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #10)
> I don't think that finding the matching binaries would be hard at all. We
> don't need any releng infrastructure: the binaries are already stored as
> part of the symbol server and can be looked up by debug ID directly from the
> minidump without any other metadata.

Ok, good to know. This matching is something that we'd want to do on almost every crash, so we'd want to cache copies of the binaries on Socorro somehow.
Just to fill in some info here:
The "symbol server" just serves files from an s3 bucket where we upload symbols (both .sym format and the native format that the toolchain produces). For Windows builds we also upload the binaries, since Microsoft's debuggers support fetching symbols+binaries given just a minidump (very handy for debugging minidumps from random builds from crash-stats).

The stackwalk binary that Socorro uses for processing has code to fetch symbols via HTTP and cache them on local disk, so we already fetch the .sym files for processing that way.

I have a script that fetches Microsoft system library symbols from their symbol server, and as part of that for Win64 libraries it will also fetch the DLL from their symbol server, since on Win64 that's where the unwind tables live:
https://hg.mozilla.org/users/tmielczarek_mozilla.com/fetch-win32-symbols/file/52d03ad15bdf/symsrv-fetch.py#l121

We'd need something like that to fetch the binaries during processing. That code shells out to `cabextract` to decompress the files (they're stored compressed on the symbol server), but I recently wrote a Rust implementation of the `makecab` program we use to compress them during the build, so it wouldn't be much work to write a cabextract-equivalent on top of that same codebase:
https://github.com/luser/rust-makecab

We would then have to have code to parse the PE binary format and locate the matching bytes of the text section given an address. There's lots of code floating around online to do that. Breakpad has some that uses Win32 APIs, and I know there's at least one Rust crate to parse PE. Handling relocations is a bit out of my range of knowledge, I assume we'd need a disassembler at that point? I've used capstone in the recent past with good results:
http://capstone-engine.org/
This came up in the context of bug 1475772, where Cameron used windbg's !chkimg command to prove that the crashes were the result of bad RAM. It would be awesome to have this automated (either during processing, or on-demand) so that developers can easily check this without getting minidump access, finding a windows machine, and doing all the steps.

Just chatted with Lonnen, WillK is going to think about how this could be done.
Flags: needinfo?(willkg)
Making this a P2 to look into.

If anyone has updates on the state of things and anything that's changed since comment #12, I'd be much obliged.
Flags: needinfo?(willkg)
Priority: -- → P2
For a weekend project, I wrote a command line tool that basically does what chkimg does.

https://github.com/heycam/chkimg

I really don't know how the crash report processor environment works so I don't know how easily that could be integrated.  But I could turn that tool into a library crate easily enough.
Wow, that's awesome - thanks Cameron!

That both (a) simplifies the process for somebody to analyze dumps without a windows machine (though they still need minidump access), and (b) hopefully make's it much more straightforward for willkg to plug this analysis into socorro.
This is awesome! I need to finish up some other things and then I can get to looking at this.

Couple of things off the top of my head:

1. Is it possible for the output from chkimg to contain user data or information that could be exploited?

2. It's probably the case we need to change chkimg to output in JSON so Socorro can read it. Probably makes sense to add an output file as an argument.

I'll write up some issues in the chkimg issue tracker when I spend more time on this.

Backing up a bit, I want to throw together a "cupcake" of this with minimal work so we can start seeing what it looks like performance-wise and also numbers. We can run this on -stage for a while to figure things out. Then if we decide that we want to move forward, we can fix performance issues, make the code maintainable, and move it forward to prod.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #17)
> 1. Is it possible for the output from chkimg to contain user data or
> information that could be exploited?

I don't think the actual values are that interesting to most
people, so just a summary like "chkimg has detected a N-bit
error in an instruction near the crash, which is likely caused
by faulty RAM" or something like that.  We can provide the full
output to the group that already has access to the minidump,
should it be needed.

It be awesome if we could also give this feedback to the crash
submitter (when possible) and provide advice on how to run
a memory check tool etc.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #17)
> 1. Is it possible for the output from chkimg to contain user data or
> information that could be exploited?

As Mats says we probably just want to know whether there was corruption rather than what the specific differences were.  (The differences printed would only be bytes from one of our DLLs, assuming you specify only our own symbol server on the command line, so that's not user specific data.)

> 2. It's probably the case we need to change chkimg to output in JSON so
> Socorro can read it. Probably makes sense to add an output file as an
> argument.
> 
> I'll write up some issues in the chkimg issue tracker when I spend more time
> on this.

Please needinfo me if you'd like me to make some changes to the tool to help, or feel free to file a PR.

> Backing up a bit, I want to throw together a "cupcake" of this with minimal
> work so we can start seeing what it looks like performance-wise and also
> numbers. We can run this on -stage for a while to figure things out. Then if
> we decide that we want to move forward, we can fix performance issues, make
> the code maintainable, and move it forward to prod.

Thanks for looking at this, I think it's going to save developers a lot of time.
That said... if it's a single-bit difference, which is probably the most common case of differences, it seems like it doesn't contain any obvious private data, and it might be interesting in terms of understanding how the crash occurred (and might sometimes be useful in distinguishing disk corruption from memory corruption by seeing whether it's consistent across reboots).  It could be more problematic in terms of data leakage if large numbers of bits were different (e.g., if part of the binary has been overwritten with something else on the user's disk).  On the other hand, writing that code is extra complexity...
If it doesn't add too much complexity, I'd suggest a count of the total number of mismatched bits across all the bytes that were checked. If a report shows a single-bit difference, we can with high confidence call it a hardware error and move on without even looking at the details. If the error is many bits, I would at least open the dump and check the page table flags on @ip to see if anyone had written to that region (injected code doing a hook or something).
Running this on a bunch of crashes shows some 1-bit corruptions (yeah! kinda), and other more systematic changes which I suspect are something unrelated to either instruction trashing or bad ram.  Relocation?  Hooking by some AV dll?  For example:

crashing IP: 0x7ff829b704c9
mismatch: 0x7ff8537ba040 .. 0x7ff8537ba04f (16 bytes) in ntdll.dll
  [ 48 b8 20 0a a3 8d f6 7f 00 00 ff e0 0d 00 00 00 ] should be [ 4c 8b d1 b8 0d 00 00 00 f6 04 25 08 03 fe 7f 01 ]

and the exact same "corruption" on multiple dumps for the same crash and same FF version, or

crashing IP: 0x612efe49
mismatch: 0x612efe34 .. 0x612efe35 (2 bytes) in xul.dll
  [ ac 63 ] should be [ c6 12 ]
mismatch: 0x612efe42 .. 0x612efe42 (1 byte) in xul.dll
  [ 0d ] should be [ 0f ]
mismatch: 0x612efff8 .. 0x612efff9 (2 bytes) in xul.dll
  [ 2f 61 ] should be [ 49 10 ]

or 

crashing IP: 0x7ffee98737e6
mismatch: 0x7ffeed330c32 .. 0x7ffeed330c35 (4 bytes) in xul.dll
  [ a2 ec fe 7f ] should be [ 53 84 01 00 ]
mismatch: 0x7ffeed330d02 .. 0x7ffeed330d05 (4 bytes) in xul.dll
  [ a2 ec fe 7f ] should be [ 53 84 01 00 ]
mismatch: 0x7ffeed330d72 .. 0x7ffeed330d75 (4 bytes) in xul.dll
  [ a9 ec fe 7f ] should be [ 5a 84 01 00 ]
mismatch: 0x7ffeed330d82 .. 0x7ffeed330d85 (4 bytes) in xul.dll
  [ 0c ed fe 7f ] should be [ bd 84 01 00 ]
[20 or 30 more like this]


or

crashing IP: 0x5a5603ab
mismatch: 0x5a560344 .. 0x5a560345 (2 bytes) in xul.dll
  [ 89 5c ] should be [ 7c 12 ]
mismatch: 0x5a5603a7 .. 0x5a5603a8 (2 bytes) in xul.dll
  [ 78 04 ] should be [ 6a 00 ]
mismatch: 0x5a5603aa .. 0x5a5603aa (1 byte) in xul.dll
  [ ff ] should be [ 01 ]
mismatch: 0x5a5603ac .. 0x5a5603ad (2 bytes) in xul.dll
  [ 94 33 ] should be [ 74 24 ]
mismatch: 0x5a5603af .. 0x5a5603b0 (2 bytes) in xul.dll
  [ 84 de ] should be [ 8b ce ]
mismatch: 0x5a5603b7 .. 0x5a5603b7 (1 byte) in xul.dll
  [ 1d ] should be [ f5 ]
mismatch: 0x5a5603b9 .. 0x5a5603b9 (1 byte) in xul.dll
  [ e8 ] should be [ ff ]
mismatch: 0x5a5603bc .. 0x5a5603bf (4 bytes) in xul.dll
  [ c3 d4 f8 40 ] should be [ 24 34 85 c0 ]
mismatch: 0x5e3baac9 .. 0x5e3baaca (2 bytes) in DWrite.dll
  [ 34 5e ] should be [ 4d 1e ]
mismatch: 0x5e3bab07 .. 0x5e3bab08 (2 bytes) in DWrite.dll
  [ 34 5e ] should be [ 4d 1e ]
mismatch: 0x5e3bab11 .. 0x5e3bab12 (2 bytes) in DWrite.dll
  [ 34 5e ] should be [ 4d 1e ]
mismatch: 0x5e3bab36 .. 0x5e3bab37 (2 bytes) in DWrite.dll
  [ 3f 5e ] should be [ 58 1e ]
mismatch: 0x5e3bab47 .. 0x5e3bab48 (2 bytes) in DWrite.dll
  [ 40 5e ] should be [ 59 1e ]
mismatch: 0x5e3bab6d .. 0x5e3bab6e (2 bytes) in DWrite.dll
  [ 34 5e ] should be [ 4d 1e ]
mismatch: 0x5e3bab78 .. 0x5e3bab79 (2 bytes) in DWrite.dll
  [ 34 5e ] should be [ 4d 1e ]
(In reply to David Major [:dmajor] from comment #22)
> If the error
> is many bits, I would at least open the dump and check the page table flags
> on @ip to see if anyone had written to that region (injected code doing a
> hook or something).

How do I check the page table flags?  Do these tools doing injection or whatever tend to just leave the page as writable after doing their dirty work?
Flags: needinfo?(dmajor)
(In reply to Cameron McCormack (:heycam) from comment #24)
> How do I check the page table flags?  Do these tools doing injection or
> whatever tend to just leave the page as writable after doing their dirty
> work?

Well, I was going to suggest the following, but it seems I'm wrong about this:

I thought that `!address @rip` would show a separate line when the current flags don't match the original allocation flags, so we could take advantage of the fact that writing to a DLL will drop its copy-on-write status:
Protect:                00000020          PAGE_EXECUTE_READ
Allocation Protect:     00000080          PAGE_EXECUTE_WRITECOPY

...but it looks like this is the case for untouched DLL pages too. :-( So you're right, this would only work if the injected code neglects to clean up after itself (which isn't unheard of, at least).

Caution: `!address` resets your exception context so you have to do `.ecxr` to get your state back.
Flags: needinfo?(dmajor)
But I guess you could `u` the mismatched bytes to see if they at least decode to valid instructions (especially a jmp or call).
Note that I just fixed a bug in chkimg which was resulting in a bunch of false positives -- I wasn't ignoring parts of the binary that were fixed up due to relocations!
I pulled down the latest chkimg today and ran it on 100 crash reports from today for Windows for 62.0.3. Of those 100, 65 printed "no errors found" and the other 35 didn't. Most of them look like mismatches, but three look like this:

>>> Working on crashdata/v1/dump/57d50768-6d72-4a62-94c6-d19f10181003...
PE failure: ResolveMapError

>>> Working on crashdata/v1/dump/73f94540-d841-4905-b7c2-965ca0181003...
PE failure: ResolveMapError

>>> Working on crashdata/v1/dump/cecc6f7e-36b5-41a6-836c-2c4780181003...
PE failure: ResolveMapError


(The "Working on..." output is from my script.)

I have a few questions.

First, I sort of understand what this utility is doing, but not enough to verify correctness over a larger set of crashes. It doesn't look like there are tests. How do we verify that this is working correctly?

Second, PE stands for Portable Executable and I sort of understand what that is, but beyond that I don't know what a PE failure might be here.

Third, 32 out of the 100 crash reports I looked at have mismatches--does that seem plausible to you? Assuming chkimg is working correctly, does that suggest that 1/3 of the crashes I looked at are from computers with bad memory?
One related metric that I'd find useful here is to auto classify if the current instruction can even fault. I often see crashes on things like |cmp rax, rdx| which have been traced to bad memory when I've engaged with users. Probably way out of scope for this bug, but might be a future extension for chkimg.

When I last looked in-depth at JIT crashes I saw at least 1/3 of crashes that where clearly bit-flips. (The numbers were a bit skewed by known-bad-CPU-silicon). Some of them manifested as bad data in the minidump like this bug looks at, and some of them showed the correct machine code, but that observed fault only makes sense if a single bit is flipped in the crashing instruction and the register and memory state makes sense.

This is interesting work, but needs more people interested in working on it to move it forward. For now, I'm going to bump it down to P3.

Priority: P2 → P3

Hey Cam! Will had a few questions in comment 28: https://bugzilla.mozilla.org/show_bug.cgi?id=1274628#c28

Do you have any insights?

Flags: needinfo?(cam)

Sorry that I overlooked this needinfo. In case it's still relevant:

First, I sort of understand what this utility is doing, but not enough to verify correctness over a larger set of crashes. It doesn't look like there are tests. How do we verify that this is working correctly?

Tests would need to be written.

Second, PE stands for Portable Executable and I sort of understand what that is, but beyond that I don't know what a PE failure might be here.

That's printing out the error that came from the pe crate I'm using:

https://docs.rs/pe/0.1.1/pe/enum.Error.html#variant.ResolveMapError

Perhaps it means that my code is trying to read out of range bytes from the DLL for some reason.

Third, 32 out of the 100 crash reports I looked at have mismatches--does that seem plausible to you? Assuming chkimg is working correctly, does that suggest that 1/3 of the crashes I looked at are from computers with bad memory?

That seems like too many! It's possible there are some kinds of relocations I wasn't handling correctly. To verify my code is doing the right thing, I would compare to running .chkimg in WinDbg.

Flags: needinfo?(cam)
You need to log in before you can comment on or make changes to this bug.