Closed Bug 1788174 Opened 2 years ago Closed 2 years ago

Crash in [@ BaseThreadInitThunk]

Categories

(Core :: mozglue, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- unaffected
firefox104 --- unaffected
firefox105 --- unaffected
firefox106 blocking fixed

People

(Reporter: aryx, Unassigned)

References

Details

(Keywords: crash)

Crash Data

10 crashes from 8 Windows 10 installations, all with the latest Nightly (106.0a1 20220830210405). There was an isolated crash report with this signature for Nightly in July. We still see this crash signature occasionally for release and rarely for Nightly.

Michael, could you take a look if this could be related to the WebRTC update (bug 1766646 etc.)? For the record, the other changes in this Nightly are https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=ecb328de1aafc36765b3bbf7f434ef84d93cad28

See bug 1740627 for a former instance of this signature which got fixed.

Crash report: https://crash-stats.mozilla.org/report/index/2ef9ff48-a7cc-4843-8781-26e180220831

Reason: EXCEPTION_STACK_BUFFER_OVERRUN / FAST_FAIL_GUARD_ICALL_CHECK_FAILURE

Top 8 frames of crashing thread:

0 ntdll.dll LdrpICallHandler 
1 ntdll.dll RtlpExecuteHandlerForException 
2 ntdll.dll RtlDispatchException 
3 ntdll.dll KiUserExceptionDispatch 
4 ntdll.dll LdrpDispatchUserCallTarget 
5 kernel32.dll BaseThreadInitThunk 
6 mozglue.dll patched_BaseThreadInitThunk toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp:581
7 ntdll.dll RtlUserThreadStart 
Flags: needinfo?(mfroman)

The Bugbug bot thinks this bug should belong to the 'Core::mozglue' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → mozglue
Product: Firefox → Core

The bug is marked as tracked for firefox106 (nightly). However, the bug still isn't assigned.

:Sylvestre, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(sledru)

Redirecting to the managers who knows much more than I about the init system
(not sure that bugbug is correct on the component)

Flags: needinfo?(sledru)
Flags: needinfo?(haftandilian)
Flags: needinfo?(gpascutto)

given the volume, I guess it is a new issue

The previous instance was a hardening against badly written injections. However here I see reports (at least on Nightly) with no obvious 3rd party modules and with a large uptime (the last may not necessarily be inconsistent, as it's happening on thread launch):

https://crash-stats.mozilla.org/report/index/60b82f2c-c17c-40ab-abc6-310230220831
https://crash-stats.mozilla.org/report/index/3e91f2ac-b3fd-49d2-9d9c-2182c0220831
https://crash-stats.mozilla.org/report/index/043c2de5-5718-46e0-be71-08b070220831
https://crash-stats.mozilla.org/report/index/655295c5-33b8-4d24-8362-5289e0220831

Unfortunately there are no correlations available on crash-stats.

There is no relation between this crash and the changes in the regression range, so either we crash due to earlier stack bustage elsewhere, or this is third party stuff anyway.

Not sure this will be very actionable.

Flags: needinfo?(gpascutto)
Severity: S2 → S1
Crash Signature: [@ BaseThreadInitThunk] → [@ BaseThreadInitThunk] [@ patched_BaseThreadInitThunk]
Priority: -- → P1
Crash Signature: [@ BaseThreadInitThunk] [@ patched_BaseThreadInitThunk] → [@ BaseThreadInitThunk] [@ patched_BaseThreadInitThunk] [@ ntdll.dll | BaseThreadInitThunk ]
Flags: needinfo?(haftandilian)
See Also: → 1713160

I don't have anything to add here. Can we tell what thread was being started from the stacks of other threads and see if there's any commonality?

Can we tell what thread was being started from the stacks of other threads and see if there's any commonality?

No. Looking at the active thread (if any) also doesn't show any obvious commonality to me.

I notice that these reports don't seem to have memory information attached. Is this because EXCEPTION_STACK_BUFFER_OVERRUN / FAST_FAIL_GUARD_ICALL_CHECK_FAILURE is a WER-caught error and we don't have that info there?

Jim, this is P1/S1 now and the WebRTC update is in the regression range. There's no other obvious (or even not so obvious?) change that can cause this, and we have very limited info to go on here, so making sure this is high on your radar.

Flags: needinfo?(jmathies)

Additionally, I was wondering if these could correlate with OOM reduction (bug 1716727), but that was first enabled in 105 and the majority of these crashes are in 106.

Several user comments mention that the crash happens when you close a tab which had Netflix playing:
"Crash when closing the tab while playing Netflix "
"Crashes have occurred a few times, when closing a Tab with Netflix open. "
"If I close the tab while playing any series or movie from the Netflix site, the application crashes."

Do we know if there's been any CDM updates in that timeframe?

This is a reminder regarding comment #2!

The bug is marked as blocking firefox106 (nightly). We have limited time to fix this, the soft freeze is in 9 days. However, the bug still isn't assigned.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #12)

Do we know if there's been any CDM updates in that timeframe?

We haven't updated the cdm in this time frame. We do have a major libwebrtc update in this time frame, as well as a security related fix in media code. Pretty tough diagnosing this though, there's nothing of much value in the stacks.

Flags: needinfo?(mfroman)
Flags: needinfo?(jmathies)

(In reply to Jim Mathies [:jimm] from comment #14)

(In reply to Gian-Carlo Pascutto [:gcp] from comment #12)

Do we know if there's been any CDM updates in that timeframe?

We haven't updated the cdm in this time frame. We do have a major libwebrtc update in this time frame, as well as a security related fix in media code. Pretty tough diagnosing this though, there's nothing of much value in the stacks.

Have we tried reproducing the crash by closing a tab where Netflix is playing?

Flags: needinfo?(jmathies)

Exception thrown at 0x00007FF9FA3D3B6E (ntdll.dll) in firefox.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.

This is easy to reproduce. Open any netflix stream and close the tab after it starts playing. Will try to find a regression range.

Flags: needinfo?(jmathies)

(In reply to Jim Mathies [:jimm] from comment #16)

Exception thrown at 0x00007FF9FA3D3B6E (ntdll.dll) in firefox.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.

This is easy to reproduce. Open any netflix stream and close the tab after it starts playing. Will try to find a regression range.

Hmm, and now I can't reproduce in the same nightly version.

Not having any luck reproducing reliably which is preventing generation of a regression range.

Flags: qe-verify?

This crash signature might be related - https://bugzilla.mozilla.org/show_bug.cgi?id=1788592

This is a reminder regarding comment #2!

The bug is marked as blocking firefox106 (nightly). We have limited time to fix this, the soft freeze is in 8 days. However, the bug still isn't assigned.

"closing the tab while playing Netflix" is pretty consistent with the GMP process shutting down. Bug 1788592 has _exit(0) in the GMP process on the stack, so I wouldn't be surprised if they are related.

The regression here appears to be fixed by https://bugzilla.mozilla.org/show_bug.cgi?id=1788592

I'm not going to dupe this since the signature has been around for a while. Will leave it to release drivers to decide what to do with this bug.

Hey gcp, any chance your Windows experts can comment on this? We addressed this by preloading oleaut in gmp, but we really don't understand what triggered the issue in the first place.

Flags: needinfo?(gpascutto)

I'm going to assume relman will be able to drop the severity and we don't need to immediately figure out who can dive more deeply into this. But we'll get to this.

Flags: needinfo?(gpascutto)

We stopped crashing on Nightly after the patch in bug 1788592 landed, I am marking it fixed for the 106 release, lowering severity and removing the P1 flag.

Severity: S1 → S2
Priority: P1 → --

To summarize findings from bug 1788592:

  • These crashes were a consequence of linking statically with the library comsupp.lib when building xul.dll. oleaut32.dll is an implicit dependency used at exit when using comsupp.lib, and we specify that oleaut32.dll should be delay-loaded when building xul.dll. Exiting the plugin-container.exe process results in calling the dynamic_atexit_destructor_for_'vtMissing' from comsupp.lib, which calls the delay-import for VariantClear from oleaut32.dll, resulting in trying to load oleaut32.dll, and failing at that because the sandbox is active and we cannot read the library file on disk.
  • There are two possible fixes: (1) we could keep pre-loading oleaut32.dll which is the current fix, or (2) we could fall back to not using comsupp.lib in xul.dll and add continuous integration checks that we don't statically link with comsupp.lib when building xul.dll (see bug 1788592 for details).

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

I think we should close out this bug and reopen a new one for the crashes that remain. They seem unrelated to the webrtc problem that was fixed here.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.