Closed Bug 1737198 Opened 3 years ago Closed 3 years ago

[DoH] In strict fallback mode, retry DoH lookups upon network failures

Categories

(Core :: Networking: DNS, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
96 Branch
Tracking Status
firefox96 --- fixed

People

(Reporter: nhnt11, Assigned: nhnt11)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

Attachments

(9 files, 1 obsolete file)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
3.38 KB, text/plain
chutten
: data-review+
Details
48 bytes, text/x-phabricator-request
Details | Review

In mode 2.5, when a DoH lookup fails to resolve an address due to any reason other than:

  1. NXDOMAIN returned from the DoH provider (TRRSkippedReason::TRR_NXDOMAIN)
  2. We skipped TRR because previously resolved addresses could not be connected to (TRRSkippedReason::TRR_DISABLED_FLAG)
  3. Confirmation state is CONFIRM_FAILED or CONFIRM_TRYING_FAILED (TRRSkippedReason::TRR_NOT_CONFIRMED)
    we should retry the DoH lookup and make sure we use a fresh connection to do so.

This patch exposes a flag that tells a TRR to set the LOAD_FRESH_CONNECTION
load flag when creating the channel for the TRR lookup. This when seen by
TRRServiceChannel will cause us to close any existing DoH connections before
connecting. After a failed TRR lookup, nsHostResolver will retry with this
flag set, if we are in strict fallback mode and the skip reason was not one
of NXDOMAIN, DISABLED_FLAG, and NOT_CONFIRMED.

TODO:

  • Haven't really looked at ODoH at all yet
  • How should confirmation change now that we're retrying?
  • If we kill the DoH connection, other ongoing lookups will be killed too.
    Need to confirm that this will at least eventually lead to a retry, and
    that this is appropriate (or what we should do instead if not) and document
    how the skip reason telemetry will look when this happens.
  • Tests
  • ???
Attachment #9247209 - Attachment description: Bug 1737198 - [WIP] [DoH] In strict fallback mode, retry DoH lookups upon network failures. r=#necko → Bug 1737198 - Part 1: Cycle DoH connection and retry upon lookup failure. r=#necko

This adds a feature to moz-http2's doh path that helps test connection
cycling. We log the remote ports of DoH requests and expose an API to
fetch the log. A specific name will trigger us to send a delayed response
to help simulate network volatility. The log is then checked for correctness
in the test script.

TODO: add assertions for the log. this is easy, just waiting to figure out
one thing: at the moment when we cycle the connection for TRR, the retry
lookup queries as well as the confirmation query all use different connections.
I think this is because at the time of their dispatch there's no cached
connection so we end up using new connections for them... for subsequent
lookups we re-use the connection that was created for Confirmation.

This is evident by looking at the log of ports seen by the server when
we dump it at the end.

Depends on D129227

Attachment #9251301 - Attachment description: Bug 1737198 - [WIP] Part 2: Test connection cycling. r=#necko,valentin! → Bug 1737198 - Part 2: Test connection cycling. r=#necko,valentin!
Depends on: 1742743

Reverts bug 1651672 / rev D86520. It's been a few releases,
we can remove this cleanup code.

Depends on D131467

Attachment #9252425 - Attachment description: Bug 1737198 - Part 5: Enable strict fallback mode by default in Nightly. r=#necko! → Bug 1737198 - Part 6: Enable strict fallback mode by default in Nightly. r=#necko!
Attached file data-review-Bug-1737198.md (obsolete) —

Data review request for telemetry probes added in Part 4 (D132109)

Attachment #9252437 - Flags: data-review?(chutten)

Comment on attachment 9252437 [details]
data-review-Bug-1737198.md

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Nihanth Subramanya is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does the data collection use a third-party collection tool?

No.


Result: datareview+

Attachment #9252437 - Flags: data-review?(chutten) → data-review+
Attached file Data review request v2

Hey chutten, the previous telemetry impl was piggy backing some data onto an existing histogram. Now I've added a new histogram for it - same as the old skip reason histogram but will exist to isolate data that was collected when this strict mode feature is on.

The questions we're answering, and everything else, remain the same.

Attachment #9253065 - Flags: data-review?(chutten)

Comment on attachment 9253065 [details]
Data review request v2

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Nihanth Subramanya is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does the data collection use a third-party collection tool?

No.


Result: datareview+

Attachment #9253065 - Flags: data-review?(chutten) → data-review+
Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/45b749688e6f Part 1: Cycle DoH connection and retry upon lookup failure. r=necko-reviewers,valentin,dragana https://hg.mozilla.org/integration/autoland/rev/17276ab6d127 Part 2: Test connection cycling. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/480e925ab804 Part 3: Remove TRR blocklist cleanup code. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/956ddc990493 Part 4: Add telemetry probes. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/ac08b423bc60 Part 5: Run some existing TRR tests in strict fallback mode. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/595e81cc111c Part 6: Enable strict fallback mode by default in Nightly. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/0b7bc91852a5 Part 7: Update source docs to mention strict fallback mode. r=necko-reviewers,valentin

Backed out 8 changesets (Bug 1743122, Bug 1737198) for causing build bustages on nsHostRecord.cpp.
Backout link
Push with failures
Failure Log
Also xpcshell - X1 Failure Log

Flags: needinfo?(nhnt11)
Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/c3b726205642 Part 1: Cycle DoH connection and retry upon lookup failure. r=necko-reviewers,valentin,dragana https://hg.mozilla.org/integration/autoland/rev/2ebd361d8512 Part 2: Test connection cycling. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/827456b1b2e8 Part 3: Remove TRR blocklist cleanup code. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/ae860da29fd4 Part 4: Add telemetry probes. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/f7eeff0744f4 Part 5: Run some existing TRR tests in strict fallback mode. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/84a05e5b5e8d Part 6: Enable strict fallback mode by default in Nightly. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/f370d8070271 Part 7: Update source docs to mention strict fallback mode. r=necko-reviewers,valentin

Backed out 8 changesets (Bug 1737198, Bug 1743122) for causing xpcshell failures in unit/test_trr.js
Backout link: https://hg.mozilla.org/integration/autoland/rev/dae06634f2bfbfd75cc653817122978c85c84319
Push with failures, failure log.
(Update): Also caused bustages in nsHostResolver.

Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/960ecb376a56 Part 1: Cycle DoH connection and retry upon lookup failure. r=necko-reviewers,valentin,dragana https://hg.mozilla.org/integration/autoland/rev/6f1e88c3daf3 Part 2: Test connection cycling. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/b703cf81d197 Part 3: Remove TRR blocklist cleanup code. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/04bae7f14cec Part 4: Add telemetry probes. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/05e67f464ee1 Part 5: Run some existing TRR tests in strict fallback mode. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/7c257276a971 Part 6: Enable strict fallback mode by default in Nightly. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/c43e8d579121 Part 7: Update source docs to mention strict fallback mode. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/5d68e2b664cc Part 8: Expose Confirmation State for testing. r=kershaw,necko-reviewers
Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/04a75a837de9 Part 1: Cycle DoH connection and retry upon lookup failure. r=necko-reviewers,valentin,dragana https://hg.mozilla.org/integration/autoland/rev/e981d9e9c11c Part 2: Test connection cycling. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/cdc9de540f2f Part 3: Remove TRR blocklist cleanup code. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/3bab6f5b5b6b Part 4: Add telemetry probes. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/189ed6145945 Part 5: Run some existing TRR tests in strict fallback mode. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/546b0f714a4d Part 6: Enable strict fallback mode by default in Nightly. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/9e4b52c06e3c Part 7: Update source docs to mention strict fallback mode. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/d742fcfb8fad Part 8: Expose Confirmation State for testing. r=kershaw,necko-reviewers

This failed to land again, this time because I forgot to check that we're actually in the socket process before trying to mirror the confirmation state. We agreed to back out. In the meantime, here is a try push with a fix, let's make sure this actually passes everything before attempting to land again...

https://treeherder.mozilla.org/jobs?repo=try&revision=dfac5e97256df9ac57079950f4559f7c9ca68b2e

Flags: needinfo?(nhnt11)

Try's green, landing this again. 🤞

Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/bc5a65d8d866 Part 1: Cycle DoH connection and retry upon lookup failure. r=necko-reviewers,valentin,dragana https://hg.mozilla.org/integration/autoland/rev/c016a5438637 Part 2: Test connection cycling. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/24db7713515a Part 3: Remove TRR blocklist cleanup code. r=valentin,necko-reviewers https://hg.mozilla.org/integration/autoland/rev/49b9db2831ac Part 4: Add telemetry probes. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/4123e58dbff2 Part 5: Run some existing TRR tests in strict fallback mode. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/6d1884416d1d Part 6: Enable strict fallback mode by default in Nightly. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/fdad127a1d21 Part 7: Update source docs to mention strict fallback mode. r=necko-reviewers,valentin https://hg.mozilla.org/integration/autoland/rev/b2126f5ed075 Part 8: Expose Confirmation State for testing. r=kershaw,necko-reviewers
Depends on: 1744247
Attachment #9252437 - Attachment is obsolete: true
Depends on: 1745870
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: