Closed Bug 1608114 Opened 4 years ago Closed 4 years ago

Delay of ~20ms between receiving a DoH response and sending a SYN packet

Categories

(Core :: Networking, defect, P1)

defect

Tracking

()

RESOLVED WORKSFORME
mozilla74
Tracking Status
firefox74 --- wontfix

People

(Reporter: mayhemer, Assigned: valentin)

References

Details

(Whiteboard: [necko-triaged][trr])

Attachments

(4 files, 1 obsolete file)

Reported to me on slack indirectly by Bas:

'I've seen this demonstrated by looking at dev tools during a DoH based HTTPS connection and correlating it with a packet trace decoded with the exported ephemeral key. Let's assume the dev tools report N msec of DNS resolution time. If you look at the traces there is indeed roughly N msec in between when the DoH request is sent and when the TCP connection to the origin (using the DNS response) is made.. However, looking at the capture, the DoH response is received significantly before the TCP connection is started.. when using legacy DNS there is not a significant gap between the DNS response and the initiation of the TCP to the origin. I'm sorry I' don't have better data, but anecdotally this delay was often in the 20+ms range.'

So, it looks like it either takes more time to process the DoH response (because we need to go to the main thread?) or there is even more complex bug (not likely IMO).

Our tooling is insufficient to look at this easily, so I'm filing a bug to log this.

Assignee: nobody → valentin.gosu
Priority: P2 → P1
Whiteboard: [necko-triaged] → [necko-triaged][trr]
  • Enables timing for TRR channels and records telemetry with time elapsed
    between the responseEnd and when the listener is notified with the
    nsHostRecord object

So, using the attached patch I ran a try build on a clean profile with TRR.
With very low activity (just opening a new tab), you rarely see more than 15-20ms delay.
But my run - opening fast.com in one tab, then opening lots of new pages at the same time gives us the results below.

I have a profile here: https://perfht.ml/3a7cHSY

DNS_TRR_PROCESSING_TIME
164 samples, average = 6.6, sum = 1,083

  0 |#########################  123  75%
  1 |##  11  7%
  2 |  2  1%
  3 |#  4  2%
  4 |  1  1%
  5 |#  3  2%
  9 |  1  1%
 12 |  1  1%
 20 |  1  1%
 23 |##  9  5%
 33 |  1  1%
 37 |  1  1%
 84 |  1  1%
120 |#  5  3%
135 |  0  0%
Attached file data-request.md
Attachment #9120729 - Flags: data-review?(chutten)
Comment on attachment 9120729 [details]
data-request.md

I was following this bug so I'll go ahead and take the data-steward review.

This is approved as-is, but please consider whether there's really a need for an indefinite collection; if we identify the root cause of the timing issue, will we continue to monitor this probe after we fix it? A renewable time-limited collection could be a good alternative.

--

1) Is there or will there be **documentation** that describes the schema for the ultimate data set in a public, complete, and accurate way?

Yes, in Histograms.json and the probe dictionary.

2) Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, Firefox users may [opt out of telemetry](https://support.mozilla.org/en-US/kb/share-data-mozilla-help-improve-firefox).

3) If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, vgosu.

4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, technical data.

5) Is the data collection request for default-on or default-off?

Default-on.

6) Does the instrumentation include the addition of **any *new* identifiers**?

No.

7) Is the data collection covered by the existing Firefox privacy notice?

Yes.

8) Does there need to be a check-in in the future to determine whether to renew the data?

n/a; permanent collection.

9) Does the data collection use a third-party collection tool?

No.
Attachment #9120729 - Flags: data-review?(chutten) → data-review+

(In reply to Tim Smith 👨‍🔬 [:tdsmith] from comment #4)

Comment on attachment 9120729 [details]
data-request.md

I was following this bug so I'll go ahead and take the data-steward review.

Thanks!

This is approved as-is, but please consider whether there's really a need
for an indefinite collection; if we identify the root cause of the timing
issue, will we continue to monitor this probe after we fix it? A renewable
time-limited collection could be a good alternative.

That depends on the fix. If we'll have a fix that ensures a small processing delay and is not affected by the browser's workload, then we can switch to a time limited probe (even though it might be worth keeping it to ensure we don't regress in the future).
However, if our fix for this is running on a thread that also processes other things, we probably want to keep it as-is.

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/8963f5a745f0
Diagnose ~20ms delay processing a DoH response r=dragana
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla74

We have landed only a diagnostic patch, not an actual fix.

Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED

This is a speculative fix for the TRR delay.
It retargets delivery of OnDataAvailable to a dedicated TRR thread and it
attempts to process the response in the last OnDataAvailable callback.
This should avoid the busier main thread and make sure we process the
response a bit sooner.

After discussing this with Dragana it was clear why this patch doesn't completely fix the problem: even if the socket has received all the data, it still needs to send OnStartRequest which happens on the main thread - so we're still waiting for the main thread events to be scheduled.
The patch offers only a marginal improvement in moving onDataAvailable and the response processing off the main thread. It seems we need something like bug 1528285 to avoid this issue completely.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #10)

After discussing this with Dragana it was clear why this patch doesn't completely fix the problem: even if the socket has received all the data, it still needs to send OnStartRequest which happens on the main thread - so we're still waiting for the main thread events to be scheduled.
The patch offers only a marginal improvement in moving onDataAvailable and the response processing off the main thread. It seems we need something like bug 1528285 to avoid this issue completely.

Yeah, I was afraid of that. Also, there is the proxy resolution callback. What about setting the priorities? Have you considered that or is it not available on the parent process?

See Also: → 1612118

Smaug, bas: is HIGH the right priority to use here? As opposed to INPUT_HIGH or MEDIUM_HIGH. I suspect this runs for only a short time, which helps with the idea of making it HIGH. Valentin, how long could this event run/block other events? << 1ms?

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(bugs)
Flags: needinfo?(bas)

mediumhigh is the right one here.

high is really for vsync and has some special magic around it (like, it let's one non-high to be process per each high), and input_* is really for input events only.
we should rename high to vsync I think.

Flags: needinfo?(bugs)

(And what has been discussed occasionally, and what would be useful here I think, is some "control queue" for passing some state or other simple thing through main thread. Such queue should have the highest possible priority always. But we don't have that atm)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #13)

Valentin, how long could this event run/block other events? << 1ms?

Definitely less than 1ms. The TRR onStart/data/stop don't do very much at all. But I'll make them mediumhigh anyway.

Flags: needinfo?(valentin.gosu)
Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/0c429fa8285b
Increase the priority of the TRR main thread events r=mayhemer

The last patch seems to have made things a little better on my machine, but it's difficult to quantify it.
It increases the priority of events being dispatched to the main thread.
I plan to also land the remaining patch - which moves OnDataAvailable and the response processing off the main thread.
Andrew, could check if you see any improvements compared to prior builds?
Thanks!

Flags: needinfo?(acreskey)

It looks like that patch landed in 20200131213333. Looking at https://colab.research.google.com/drive/1ksUeWjz9Fx7xegoKEHuShcvzbvyviTRn I don't see a convincing change in any of

  • the fraction of dns_trr_processing_time measurements greater than zero
  • the median user's geometric mean of dns_trr_processing_time measurements greater than zero
  • the median user's arithmetic mean of all dns_trr_processing_time measurements

comparing the before and after builds.

Valentin, as far as I can see, none of the builds of Fenix or Geckoview_example have DoH enabled. (I'm looking at the pref, network.trr.mode, which is set to 0)

Flags: needinfo?(acreskey)

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #21)

Valentin, as far as I can see, none of the builds of Fenix or Geckoview_example have DoH enabled. (I'm looking at the pref, network.trr.mode, which is set to 0)

You're right, we don't enable the feature for mobile. Honza is working on a similar approach in bug 1612118 which should also apply to mobile.
Unfortunately, from Tim's analysis in comment 20 it seems this didn't really fix the root cause.

By all indications it seems bug 1615335 has fixed this issue by moving TRR requests to a background thread!
We can see the telemetry for going down (then slightly up for a few days when we turned off the feature) to 0. My local telemetry shows all requests completed with less than 1ms (ie 0ms) delay, which is what we want to see.

Most of the implementation was done by Kershaw, so I want to acknowledge the awesome work he did.

Flags: needinfo?(bas)
Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Keywords: leave-open
Resolution: --- → WORKSFORME
See Also: → 1615335
Attachment #9122023 - Attachment is obsolete: true

(In reply to Valentin Gosu [:valentin] (he/him) from comment #23)

By all indications it seems bug 1615335 has fixed this issue by moving TRR requests to a background thread!
We can see the telemetry for going down (then slightly up for a few days when we turned off the feature) to 0. My local telemetry shows all requests completed with less than 1ms (ie 0ms) delay, which is what we want to see.

Most of the implementation was done by Kershaw, so I want to acknowledge the awesome work he did.

Firefox Private Network extension (FPN) turns off network.trr.fetch_off_main_thread, i.e. bug 1615335. [1][2]
Does it mean that FPN users are still potentially affected by this bug? Thanks.

[1]: bug 1620824 comment 17
[2]: Github issue #856 and #858

(In reply to Fanolian from comment #24)

Firefox Private Network extension (FPN) turns off network.trr.fetch_off_main_thread, i.e. bug 1615335. [1][2]
Does it mean that FPN users are still potentially affected by this bug? Thanks.

Very few of the requests when using a proxy actually need to do their own DNS resolution ( I think it may only be WebRTC )
The majority of DNS resolutions are performed by the proxy itself.

See Also: → 1644478
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: