Delay of ~20ms between receiving a DoH response and sending a SYN packet
Categories
(Core :: Networking, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox74 | --- | wontfix |
People
(Reporter: mayhemer, Assigned: valentin)
References
Details
(Whiteboard: [necko-triaged][trr])
Attachments
(4 files, 1 obsolete file)
Reported to me on slack indirectly by Bas:
'I've seen this demonstrated by looking at dev tools during a DoH based HTTPS connection and correlating it with a packet trace decoded with the exported ephemeral key. Let's assume the dev tools report N msec of DNS resolution time. If you look at the traces there is indeed roughly N msec in between when the DoH request is sent and when the TCP connection to the origin (using the DNS response) is made.. However, looking at the capture, the DoH response is received significantly before the TCP connection is started.. when using legacy DNS there is not a significant gap between the DNS response and the initiation of the TCP to the origin. I'm sorry I' don't have better data, but anecdotally this delay was often in the 20+ms range.'
So, it looks like it either takes more time to process the DoH response (because we need to go to the main thread?) or there is even more complex bug (not likely IMO).
Our tooling is insufficient to look at this easily, so I'm filing a bug to log this.
Updated•4 years ago
|
Assignee | ||
Comment 1•4 years ago
|
||
- Enables timing for TRR channels and records telemetry with time elapsed
between the responseEnd and when the listener is notified with the
nsHostRecord object
Assignee | ||
Comment 2•4 years ago
|
||
So, using the attached patch I ran a try build on a clean profile with TRR.
With very low activity (just opening a new tab), you rarely see more than 15-20ms delay.
But my run - opening fast.com in one tab, then opening lots of new pages at the same time gives us the results below.
I have a profile here: https://perfht.ml/3a7cHSY
DNS_TRR_PROCESSING_TIME
164 samples, average = 6.6, sum = 1,083
0 |######################### 123 75%
1 |## 11 7%
2 | 2 1%
3 |# 4 2%
4 | 1 1%
5 |# 3 2%
9 | 1 1%
12 | 1 1%
20 | 1 1%
23 |## 9 5%
33 | 1 1%
37 | 1 1%
84 | 1 1%
120 |# 5 3%
135 | 0 0%
Assignee | ||
Comment 3•4 years ago
|
||
Comment 4•4 years ago
|
||
Comment on attachment 9120729 [details] data-request.md I was following this bug so I'll go ahead and take the data-steward review. This is approved as-is, but please consider whether there's really a need for an indefinite collection; if we identify the root cause of the timing issue, will we continue to monitor this probe after we fix it? A renewable time-limited collection could be a good alternative. -- 1) Is there or will there be **documentation** that describes the schema for the ultimate data set in a public, complete, and accurate way? Yes, in Histograms.json and the probe dictionary. 2) Is there a control mechanism that allows the user to turn the data collection on and off? Yes, Firefox users may [opt out of telemetry](https://support.mozilla.org/en-US/kb/share-data-mozilla-help-improve-firefox). 3) If the request is for permanent data collection, is there someone who will monitor the data over time? Yes, vgosu. 4) Using the **[category system of data types](https://wiki.mozilla.org/Firefox/Data_Collection)** on the Mozilla wiki, what collection type of data do the requested measurements fall under? Category 1, technical data. 5) Is the data collection request for default-on or default-off? Default-on. 6) Does the instrumentation include the addition of **any *new* identifiers**? No. 7) Is the data collection covered by the existing Firefox privacy notice? Yes. 8) Does there need to be a check-in in the future to determine whether to renew the data? n/a; permanent collection. 9) Does the data collection use a third-party collection tool? No.
Assignee | ||
Comment 5•4 years ago
|
||
(In reply to Tim Smith 👨🔬 [:tdsmith] from comment #4)
Comment on attachment 9120729 [details]
data-request.mdI was following this bug so I'll go ahead and take the data-steward review.
Thanks!
This is approved as-is, but please consider whether there's really a need
for an indefinite collection; if we identify the root cause of the timing
issue, will we continue to monitor this probe after we fix it? A renewable
time-limited collection could be a good alternative.
That depends on the fix. If we'll have a fix that ensures a small processing delay and is not affected by the browser's workload, then we can switch to a time limited probe (even though it might be worth keeping it to ensure we don't regress in the future).
However, if our fix for this is running on a thread that also processes other things, we probably want to keep it as-is.
Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/8963f5a745f0 Diagnose ~20ms delay processing a DoH response r=dragana
Comment 7•4 years ago
|
||
bugherder |
Reporter | ||
Comment 8•4 years ago
|
||
We have landed only a diagnostic patch, not an actual fix.
Reporter | ||
Updated•4 years ago
|
Assignee | ||
Comment 9•4 years ago
|
||
This is a speculative fix for the TRR delay.
It retargets delivery of OnDataAvailable to a dedicated TRR thread and it
attempts to process the response in the last OnDataAvailable callback.
This should avoid the busier main thread and make sure we process the
response a bit sooner.
Assignee | ||
Comment 10•4 years ago
|
||
After discussing this with Dragana it was clear why this patch doesn't completely fix the problem: even if the socket has received all the data, it still needs to send OnStartRequest which happens on the main thread - so we're still waiting for the main thread events to be scheduled.
The patch offers only a marginal improvement in moving onDataAvailable and the response processing off the main thread. It seems we need something like bug 1528285 to avoid this issue completely.
Reporter | ||
Comment 11•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #10)
After discussing this with Dragana it was clear why this patch doesn't completely fix the problem: even if the socket has received all the data, it still needs to send OnStartRequest which happens on the main thread - so we're still waiting for the main thread events to be scheduled.
The patch offers only a marginal improvement in moving onDataAvailable and the response processing off the main thread. It seems we need something like bug 1528285 to avoid this issue completely.
Yeah, I was afraid of that. Also, there is the proxy resolution callback. What about setting the priorities? Have you considered that or is it not available on the parent process?
Assignee | ||
Comment 12•4 years ago
|
||
Comment 13•4 years ago
|
||
Smaug, bas: is HIGH the right priority to use here? As opposed to INPUT_HIGH or MEDIUM_HIGH. I suspect this runs for only a short time, which helps with the idea of making it HIGH. Valentin, how long could this event run/block other events? << 1ms?
Comment 14•4 years ago
|
||
mediumhigh is the right one here.
high is really for vsync and has some special magic around it (like, it let's one non-high to be process per each high), and input_* is really for input events only.
we should rename high to vsync I think.
Comment 15•4 years ago
|
||
(And what has been discussed occasionally, and what would be useful here I think, is some "control queue" for passing some state or other simple thing through main thread. Such queue should have the highest possible priority always. But we don't have that atm)
Assignee | ||
Comment 16•4 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #13)
Valentin, how long could this event run/block other events? << 1ms?
Definitely less than 1ms. The TRR onStart/data/stop don't do very much at all. But I'll make them mediumhigh anyway.
Comment 17•4 years ago
|
||
Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/0c429fa8285b Increase the priority of the TRR main thread events r=mayhemer
Comment 18•4 years ago
|
||
bugherder |
Assignee | ||
Comment 19•4 years ago
|
||
The last patch seems to have made things a little better on my machine, but it's difficult to quantify it.
It increases the priority of events being dispatched to the main thread.
I plan to also land the remaining patch - which moves OnDataAvailable and the response processing off the main thread.
Andrew, could check if you see any improvements compared to prior builds?
Thanks!
Comment 20•4 years ago
|
||
It looks like that patch landed in 20200131213333. Looking at https://colab.research.google.com/drive/1ksUeWjz9Fx7xegoKEHuShcvzbvyviTRn I don't see a convincing change in any of
- the fraction of dns_trr_processing_time measurements greater than zero
- the median user's geometric mean of dns_trr_processing_time measurements greater than zero
- the median user's arithmetic mean of all dns_trr_processing_time measurements
comparing the before and after builds.
Comment 21•4 years ago
|
||
Valentin, as far as I can see, none of the builds of Fenix or Geckoview_example have DoH enabled. (I'm looking at the pref, network.trr.mode
, which is set to 0)
Assignee | ||
Comment 22•4 years ago
|
||
(In reply to Andrew Creskey [:acreskey] [he/him] from comment #21)
Valentin, as far as I can see, none of the builds of Fenix or Geckoview_example have DoH enabled. (I'm looking at the pref,
network.trr.mode
, which is set to 0)
You're right, we don't enable the feature for mobile. Honza is working on a similar approach in bug 1612118 which should also apply to mobile.
Unfortunately, from Tim's analysis in comment 20 it seems this didn't really fix the root cause.
Assignee | ||
Comment 23•4 years ago
|
||
By all indications it seems bug 1615335 has fixed this issue by moving TRR requests to a background thread!
We can see the telemetry for going down (then slightly up for a few days when we turned off the feature) to 0. My local telemetry shows all requests completed with less than 1ms (ie 0ms) delay, which is what we want to see.
Most of the implementation was done by Kershaw, so I want to acknowledge the awesome work he did.
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 24•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #23)
By all indications it seems bug 1615335 has fixed this issue by moving TRR requests to a background thread!
We can see the telemetry for going down (then slightly up for a few days when we turned off the feature) to 0. My local telemetry shows all requests completed with less than 1ms (ie 0ms) delay, which is what we want to see.Most of the implementation was done by Kershaw, so I want to acknowledge the awesome work he did.
Firefox Private Network extension (FPN) turns off network.trr.fetch_off_main_thread
, i.e. bug 1615335. [1][2]
Does it mean that FPN users are still potentially affected by this bug? Thanks.
[1]: bug 1620824 comment 17
[2]: Github issue #856 and #858
Assignee | ||
Comment 25•4 years ago
|
||
(In reply to Fanolian from comment #24)
Firefox Private Network extension (FPN) turns off
network.trr.fetch_off_main_thread
, i.e. bug 1615335. [1][2]
Does it mean that FPN users are still potentially affected by this bug? Thanks.
Very few of the requests when using a proxy actually need to do their own DNS resolution ( I think it may only be WebRTC )
The majority of DNS resolutions are performed by the proxy itself.
Updated•4 years ago
|
Description
•