Closed Bug 1608114 Opened 4 years ago Closed 4 years ago

Delay of ~20ms between receiving a DoH response and sending a SYN packet

Tracking

()

Status:

RESOLVED WORKSFORME

Milestone:

mozilla74

Tracking Flags:

Tracking

Status

firefox74

---

wontfix

People

(Reporter: mayhemer, Assigned: valentin)

References

Details

(Whiteboard: [necko-triaged][trr])

Attachments

(4 files, 1 obsolete file)

Bug 1608114 - Diagnose ~20ms delay processing a DoH response r=dragana 4 years ago Valentin Gosu [:valentin] (he/him) 47 bytes, text/x-phabricator-request		Details \| Review
data-request.md 4 years ago Valentin Gosu [:valentin] (he/him) 2.73 KB, text/plain	tdsmith : data-review+	Details
Bug 1608114 - Add dedicated thread for processing TRR responses r=dragana 4 years ago Valentin Gosu [:valentin] (he/him) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1608114 - Increase the priority of the TRR main thread events r=mayhemer 4 years ago Valentin Gosu [:valentin] (he/him) 47 bytes, text/x-phabricator-request		Details \| Review
Screenshot_2020-03-10 Evolution Dashboard.png 4 years ago Valentin Gosu [:valentin] (he/him) 110.01 KB, image/png		Details

Honza Bambas (:mayhemer)

Reporter

Description

•

4 years ago

Reported to me on slack indirectly by Bas:

'I've seen this demonstrated by looking at dev tools during a DoH based HTTPS connection and correlating it with a packet trace decoded with the exported ephemeral key. Let's assume the dev tools report N msec of DNS resolution time. If you look at the traces there is indeed roughly N msec in between when the DoH request is sent and when the TCP connection to the origin (using the DNS response) is made.. However, looking at the capture, the DoH response is received significantly before the TCP connection is started.. when using legacy DNS there is not a significant gap between the DNS response and the initiation of the TCP to the origin. I'm sorry I' don't have better data, but anecdotally this delay was often in the 20+ms range.'

So, it looks like it either takes more time to process the DoH response (because we need to go to the main thread?) or there is even more complex bug (not likely IMO).

Our tooling is insufficient to look at this easily, so I'm filing a bug to log this.

Nhi Nguyen (:nhi)

Updated

•

4 years ago

Assignee: nobody → valentin.gosu

Priority: P2 → P1

Whiteboard: [necko-triaged] → [necko-triaged][trr]

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 1

•

4 years ago

Attached file Bug 1608114 - Diagnose ~20ms delay processing a DoH response r=dragana — Details

Enables timing for TRR channels and records telemetry with time elapsed
between the responseEnd and when the listener is notified with the
nsHostRecord object

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 2

•

4 years ago

So, using the attached patch I ran a try build on a clean profile with TRR.
With very low activity (just opening a new tab), you rarely see more than 15-20ms delay.
But my run - opening fast.com in one tab, then opening lots of new pages at the same time gives us the results below.

I have a profile here: https://perfht.ml/3a7cHSY

DNS_TRR_PROCESSING_TIME
164 samples, average = 6.6, sum = 1,083

  0 |#########################  123  75%
  1 |##  11  7%
  2 |  2  1%
  3 |#  4  2%
  4 |  1  1%
  5 |#  3  2%
  9 |  1  1%
 12 |  1  1%
 20 |  1  1%
 23 |##  9  5%
 33 |  1  1%
 37 |  1  1%
 84 |  1  1%
120 |#  5  3%
135 |  0  0%

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 3

•

4 years ago

Attached file data-request.md — Details

Attachment #9120729 - Flags: data-review?(chutten)

Tim Smith (inactive) 👨‍🔬 [:tdsmith]

Comment 4

•

4 years ago

Comment on attachment 9120729 [details]
data-request.md

I was following this bug so I'll go ahead and take the data-steward review.

This is approved as-is, but please consider whether there's really a need for an indefinite collection; if we identify the root cause of the timing issue, will we continue to monitor this probe after we fix it? A renewable time-limited collection could be a good alternative.

--

1) Is there or will there be **documentation** that describes the schema for the ultimate data set in a public, complete, and accurate way?

Yes, in Histograms.json and the probe dictionary.

2) Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, Firefox users may [opt out of telemetry](https://support.mozilla.org/en-US/kb/share-data-mozilla-help-improve-firefox).

3) If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, vgosu.

4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, technical data.

5) Is the data collection request for default-on or default-off?

Default-on.

6) Does the instrumentation include the addition of **any *new* identifiers**?

No.

7) Is the data collection covered by the existing Firefox privacy notice?

Yes.

8) Does there need to be a check-in in the future to determine whether to renew the data?

n/a; permanent collection.

9) Does the data collection use a third-party collection tool?

No.

Attachment #9120729 - Flags: data-review?(chutten) → data-review+

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 5

•

4 years ago

(In reply to Tim Smith 👨‍🔬 [:tdsmith] from comment #4)

Comment on attachment 9120729 [details]
data-request.md

I was following this bug so I'll go ahead and take the data-steward review.

Thanks!

This is approved as-is, but please consider whether there's really a need
for an indefinite collection; if we identify the root cause of the timing
issue, will we continue to monitor this probe after we fix it? A renewable
time-limited collection could be a good alternative.

That depends on the fix. If we'll have a fix that ensures a small processing delay and is not affected by the browser's workload, then we can switch to a time limited probe (even though it might be worth keeping it to ensure we don't regress in the future).
However, if our fix for this is running on a thread that also processes other things, we probably want to keep it as-is.

Pulsebot

Comment 6

•

4 years ago

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/8963f5a745f0
Diagnose ~20ms delay processing a DoH response r=dragana

Cristian Brindusan [:cbrindusan]

Comment 7

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/8963f5a745f0

Status: NEW → RESOLVED

Closed: 4 years ago

status-firefox74: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla74

Honza Bambas (:mayhemer)

Reporter

Comment 8

•

4 years ago

We have landed only a diagnostic patch, not an actual fix.

Status: RESOLVED → REOPENED

status-firefox74: fixed → affected

Keywords: leave-open

Resolution: FIXED → ---

Honza Bambas (:mayhemer)

Reporter

Updated

•

4 years ago

Status: REOPENED → ASSIGNED

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 9

•

4 years ago

Attached file Bug 1608114 - Add dedicated thread for processing TRR responses r=dragana (obsolete) — Details

This is a speculative fix for the TRR delay.
It retargets delivery of OnDataAvailable to a dedicated TRR thread and it
attempts to process the response in the last OnDataAvailable callback.
This should avoid the busier main thread and make sure we process the
response a bit sooner.

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 10

•

4 years ago

After discussing this with Dragana it was clear why this patch doesn't completely fix the problem: even if the socket has received all the data, it still needs to send OnStartRequest which happens on the main thread - so we're still waiting for the main thread events to be scheduled.
The patch offers only a marginal improvement in moving onDataAvailable and the response processing off the main thread. It seems we need something like bug 1528285 to avoid this issue completely.

Honza Bambas (:mayhemer)

Reporter

Comment 11

•

4 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #10)

After discussing this with Dragana it was clear why this patch doesn't completely fix the problem: even if the socket has received all the data, it still needs to send OnStartRequest which happens on the main thread - so we're still waiting for the main thread events to be scheduled.
The patch offers only a marginal improvement in moving onDataAvailable and the response processing off the main thread. It seems we need something like bug 1528285 to avoid this issue completely.

Yeah, I was afraid of that. Also, there is the proxy resolution callback. What about setting the priorities? Have you considered that or is it not available on the parent process?

Honza Bambas (:mayhemer)

Reporter

Updated

•

4 years ago

Comment 12

•

4 years ago

Attached file Bug 1608114 - Increase the priority of the TRR main thread events r=mayhemer — Details

Randell Jesup [:jesup] (needinfo me)

Comment 13

•

4 years ago

Smaug, bas: is HIGH the right priority to use here? As opposed to INPUT_HIGH or MEDIUM_HIGH. I suspect this runs for only a short time, which helps with the idea of making it HIGH. Valentin, how long could this event run/block other events? << 1ms?

Flags: needinfo?(valentin.gosu)

Flags: needinfo?(bugs)

Flags: needinfo?(bas)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 14

•

4 years ago

mediumhigh is the right one here.

high is really for vsync and has some special magic around it (like, it let's one non-high to be process per each high), and input_* is really for input events only.
we should rename high to vsync I think.

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 15

•

4 years ago

(And what has been discussed occasionally, and what would be useful here I think, is some "control queue" for passing some state or other simple thing through main thread. Such queue should have the highest possible priority always. But we don't have that atm)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 16

•

4 years ago

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #13)

Valentin, how long could this event run/block other events? << 1ms?

Definitely less than 1ms. The TRR onStart/data/stop don't do very much at all. But I'll make them mediumhigh anyway.

Flags: needinfo?(valentin.gosu)

Pulsebot

Comment 17

•

4 years ago

Pushed by valentin.gosu@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/0c429fa8285b
Increase the priority of the TRR main thread events r=mayhemer

Noemi Erli[:noemi_erli]

Comment 18

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0c429fa8285b

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 19

•

4 years ago

The last patch seems to have made things a little better on my machine, but it's difficult to quantify it.
It increases the priority of events being dispatched to the main thread.
I plan to also land the remaining patch - which moves OnDataAvailable and the response processing off the main thread.
Andrew, could check if you see any improvements compared to prior builds?
Thanks!

Flags: needinfo?(acreskey)

Tim Smith (inactive) 👨‍🔬 [:tdsmith]

Comment 20

•

4 years ago

It looks like that patch landed in 20200131213333. Looking at https://colab.research.google.com/drive/1ksUeWjz9Fx7xegoKEHuShcvzbvyviTRn I don't see a convincing change in any of

the fraction of dns_trr_processing_time measurements greater than zero
the median user's geometric mean of dns_trr_processing_time measurements greater than zero
the median user's arithmetic mean of all dns_trr_processing_time measurements

comparing the before and after builds.

Andrew Creskey [:acreskey]

Comment 21

•

4 years ago

Valentin, as far as I can see, none of the builds of Fenix or Geckoview_example have DoH enabled. (I'm looking at the pref, network.trr.mode, which is set to 0)

Flags: needinfo?(acreskey)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 22

•

4 years ago

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #21)

Valentin, as far as I can see, none of the builds of Fenix or Geckoview_example have DoH enabled. (I'm looking at the pref, network.trr.mode, which is set to 0)

You're right, we don't enable the feature for mobile. Honza is working on a similar approach in bug 1612118 which should also apply to mobile.
Unfortunately, from Tim's analysis in comment 20 it seems this didn't really fix the root cause.

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 23

•

4 years ago

Attached image Screenshot_2020-03-10 Evolution Dashboard.png — Details

By all indications it seems bug 1615335 has fixed this issue by moving TRR requests to a background thread!
We can see the telemetry for going down (then slightly up for a few days when we turned off the feature) to 0. My local telemetry shows all requests completed with less than 1ms (ie 0ms) delay, which is what we want to see.

Most of the implementation was done by Kershaw, so I want to acknowledge the awesome work he did.

Flags: needinfo?(bas)

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

4 years ago

Status: ASSIGNED → RESOLVED

Closed: 4 years ago → 4 years ago

Keywords: leave-open

Resolution: --- → WORKSFORME

Updated

•

4 years ago

Attachment #9122023 - Attachment is obsolete: true

Fanolian

Comment 24

•

4 years ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #23)

By all indications it seems bug 1615335 has fixed this issue by moving TRR requests to a background thread!
We can see the telemetry for going down (then slightly up for a few days when we turned off the feature) to 0. My local telemetry shows all requests completed with less than 1ms (ie 0ms) delay, which is what we want to see.

Most of the implementation was done by Kershaw, so I want to acknowledge the awesome work he did.

Firefox Private Network extension (FPN) turns off network.trr.fetch_off_main_thread, i.e. bug 1615335. [1][2]
Does it mean that FPN users are still potentially affected by this bug? Thanks.

[1]: bug 1620824 comment 17
[2]: Github issue #856 and #858

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 25

•

4 years ago

(In reply to Fanolian from comment #24)

Firefox Private Network extension (FPN) turns off network.trr.fetch_off_main_thread, i.e. bug 1615335. [1][2]
Does it mean that FPN users are still potentially affected by this bug? Thanks.

Very few of the requests when using a proxy actually need to do their own DNS resolution ( I think it may only be WebRTC )
The majority of DNS resolutions are performed by the proxy itself.

Ryan VanderMeulen [:RyanVM]

Updated

•

4 years ago

status-firefox74: affected → wontfix

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

4 years ago