Closed Bug 1603474 Opened 4 years ago Closed 4 years ago

[Intermittent] Broken connectivity on herokuapp.com

Categories

(Core :: WebRTC: Networking, defect, P2)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox72 --- affected
firefox73 --- affected

People

(Reporter: asoncutean, Assigned: dminor)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files, 1 obsolete file)

[Affected versions]:

  • Fx 72.0b5
  • Fx 73.0a1

[Affected platforms]:

  • I don’t see a general pattern here, the issue appears between different Platforms/ Browsers (Windows 10, Windows 7, Ubuntu 18.04, macOS 10.15 / Firefox, Chrome, Safari) combination, regarding on which end the call initiator, respectively the receiver is.

[Steps to reproduce]:

  1. Open https://evening-thicket-98446.herokuapp.com/src/content/peerconnection/filetransfer-b2b/#callingId=192378 on one end
  2. Copy-paste “ this link ” on another end
  3. Initiate a call
  4. Observe the Share button

[Expected result]:

  • The Share button is active

[Actual result]:

  • The Share button is inactive

[Regression range]:

  • Not sure if we can determine one since the issue is intermittent, but I will give it a try asap.

[Additional notes]:

  • It looks like media.peerconnection.ice.obfuscate_host_addresses set on False still triggers this issue intermittently

I'm not able to reproduce on a MBP between Nightly (other end in Chrome).

Do you see any error messages in web console or browser console when the button is inactive? On either end? Any indication of failures in about:webrtc?

Are you only seeing this problem in Firefox?

Flags: needinfo?(anca.soncutean)
Attached image share not active.png

Ok, so I reproduced it on second try. on Windows 10 [72.0b5] to macOS 10.15 [Chrome]. I'll attach a screenshot with console browser here.

Flags: needinfo?(anca.soncutean)
Attached image webrtc.png

And here is a screenshot with the data from about:webrtc page on Windows 10's side.

What I find strange is that Firefox never seems to use the srflx candidate even though it shows up in the local sdp. So if something goes wrong with the host candidate, then the connection fails. From comment#0, this seems to happen occasionally even if the mDNS stuff is disabled.

When testing, you have to be very careful to only have two systems using the same callingId. I thought I had it reproducing 100% of the time between Firefox on Windows 10 and Chrome on OS X only to realize I had left a Chrome instance open on the Windows 10 machine using the same callingId.

There are a few timing things that might be at fault here. We shouldn't fail ICE if we have a pending mDNS query, but that might be happening. We might need to extend the ICE trickle timeout period to allow for time to resolve mDNS addresses. We might need to allow more time in the mDNS resolver itself. We might need to see why we're never using the local srflx address.

This didn't show up with QA was testing with Nightly a few weeks ago, but both Anca and I have reproduced it with Nightly builds from that timeframe. So it's possible that Chrome has tightened up their timing and that is why we're seeing this problem now.

Assignee: nobody → dminor

It looks to me like Firefox will never pair a srflx or prflx address [1]. So what we're seeing above is that the host candidate is failing for one reason or another and the connection fails. If I simulate this by commenting out host candidate generation here [2], the connection fails every single time. If I remove the "goto: done" from [1], it will succeed using a srflx or prflx address.

There's likely some improvements to be made to the timing of the mDNS stuff to make failures on the local network less likely, but mDNS host candidates are never going to work across network boundaries, so it seems like we need to be using reflex candidates if the host candidates fail.

:bwc, do you know why we don't pair srflx and prflx addresses? Is there somewhere else where reflex candidates should be used that we're somehow missing with the mDNS stuff enabled? Thanks!

[1] https://searchfox.org/mozilla-central/rev/2f09184ec781a2667feec87499d4b81b32b6c48e/media/mtransport/third_party/nICEr/src/ice/ice_component.c#1082
[2] https://searchfox.org/mozilla-central/rev/2f09184ec781a2667feec87499d4b81b32b6c48e/media/mtransport/third_party/nICEr/src/ice/ice_component.c#239

Flags: needinfo?(docfaraday)

The thought is that if we have a prflx or srflx, we have already paired the host candidate that they are based on, because we don't want to create redundant pairs. But I guess if we fail to gather the host candidate we end up in this weird situation where we can send packets and get srflx/prflx without ever learning what the host candidate is, and so there's never a pair.

Flags: needinfo?(docfaraday)

Under normal circumstances reflex candidates are not paired because they will
be redundant with host candidates. When hostname obfuscation is used, we can
get in a situation where host candidates fail but reflex candidates will
succeed so it makes sense to pair them in this case.

Has Regression Range: --- → no

It looks like this issue is not a regression, I’ve reproduced it way back to Fx 45.0a1 (older builds are either broken, or the site is not functional on those earlier versions). “Share” button remains inactive , regardless from where the call is initiated (Firefox or Chrome).

Has Regression Range: no → ---
Attachment #9115902 - Attachment is obsolete: true
Priority: -- → P2

Based on rise in prevalence of symptoms—which seems relevant to determining severity—should this perhaps be viewed as a regression from the introduction of mDNS concealment of host candidates?

(In reply to Jan-Ivar Bruaroey [:jib] (needinfo? me) from comment #9)

Based on rise in prevalence of symptoms—which seems relevant to determining severity—should this perhaps be viewed as a regression from the introduction of mDNS concealment of host candidates?

Well, this is already marked as blocking the mDNS meta bug. I'm not even sure we can say there is a rise in prevalence of symptoms, this site was used for testing mDNS for Nightly 72 and no problems were noticed at that time.

One thing that just occurred to me is that we have the network/socket process enabled on Firefox Nightly, but as far as I know not on Beta and perhaps that is why things are behaving differently. I've been doing most of my testing on Nightly and have not seen consistent problems.

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.