Closed Bug 1474159 Opened 6 years ago Closed 6 years ago

Windows clone times regressed significantly in late April 2018 (with hg upgrade?)

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: grenade)

Details

(Keywords: regression)

Attachments

(3 files)

Hey, so I'm looking at per-locale nightly-l10n timing, and it is pretty clear that we have a regression on windows only (see attached graph screenshots)

The mercurial range for the regression is https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e33e5049d9f35e16893f5ba4ec823f69f04972dd&tochange=8d85e4da25463ccdcf902ac5a846773b8b642553

While two randomly selected tasks from that range are:
https://tools.taskcluster.net/groups/BlO0GndDTK-j0b5LSKQj3Q/tasks/RbVoTK3LSy6ekwvi9onlIA/runs/0/artifacts
https://tools.taskcluster.net/groups/TYX1YnVySe2p1xxNi2OM-w/tasks/QaR5Mpw8Sk-C9l5Yt32Jsg/details

If you see those logs you'll see the OpenCloudConfig csets, which gives us additional regression range of https://github.com/mozilla-releng/OpenCloudConfig/compare/3e0ff7feb6513c98074ef272425b2173c219e771...8a69ab18952510bf2569b0549ffd78712426b4be

Looking at both ranges the only likely culprit is the Mercurial upgrade, and indeed in the pre-regression task clone time is about 5 minutes, and post-regression about 20 minutes!
p.s. Ignore the "per task" in the Y column of those plots, these *are* really per-locale
Flags: needinfo?(gps)
First, it's worth noting exactly what VCS operation is slow: there's a difference between:

a) initial clone of repository to local disk
b) fetching new data from server to local disk
c) updating the working directory to contain filed tracked under version control

Fortunately, modern versions of `hg robustcheckout` differentiate these operations and record metrics for how long each takes. e.g.

PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "clone", "shouldAlert": false, "subtests": [], "value": 282.5940001010895}, {"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "pull", "shouldAlert": false, "subtests": [], "value": 9.003999948501587}, {"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "update", "shouldAlert": false, "subtests": [], "value": 873.2530000209808}, {"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "overall", "shouldAlert": false, "subtests": [], "value": 1166.6189999580383}]}

The "cloning" bit in this example takes 282.59s, or 4:42s. The "update" bit takes 873.25s, or 14:33s.

Both those times are pretty slow (compared to Linux). And both are much slower than what I'm able to achieve on my local Windows machine. I've long suspected our Windows EC2 instances have very sub-optimal I/O. But I haven't had time to investigate it further. But that's for another bug, as this one is about a regression.

While I haven't tested, I suspect the regression is caused by both the Mercurial upgrade *and* the disabling of Windows Defender in https://github.com/mozilla-releng/OpenCloudConfig/commit/4f1899294022a1ff944e62d209d4445d08a09463.

Mercurial uses a thread pool for closing file handles on Windows to work around Windows Defender scanning files on close. When Windows Defender is running, this can shave *minutes* off of operations like `hg clone` and `hg update` that write hundreds of thousands of files. However, when Windows Defender isn't running, this thread pool is overhead and slows down operations.

Setting worker.backgroundclose=false in the hgrc (or on the command line via `--config worker.backgroundclose=false` will disable this feature and will likely make operations faster now that Windows Defender is disabled (4f1899294022a1ff944e62d209d4445d08a09463).

The Mercurial 4.5 upgrade likely regressed performance because Mercurial uses a thread pool for working directory updates in 4.5: https://www.mercurial-scm.org/repo/hg/rev/02b36e860e0b. In theory this should make `hg update` faster. However, because it is a thread pool and Python's multithreading is horrible (because the GIL) and because the c4.4xlarge instances have 16 CPUs, we're likely over-saturating with too many threads and the context switching is likely grinding Python to a crawl.

We may want to pin the number of workers/threads on Windows to a reasonable ceiling, which we derive from measuring what works best. We can do this by setting the worker.numcpus config option. My experience tells me we won't see much of a speedup after 2 or 3 threads. But the I/O on these EC2 instances is wonky. So my experience with Windows I/O and Python performance may not translate well. We should definitely measure.

I have Try pushes at:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ebac746ac0d2b0e6d8f641b7f6e50c6afde7a48d (no background close)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4d0e65791ca031894f9bab2c74c34629bdf46499 (no background close / no worker)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=25c07afca721eeb39563897bfb0cad1ccf5ff283 (no background close / 2 threads)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0eabaf7ed1da000325cc5d40806d2a837d60ee6 (no worker)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc1d052f98f4e7827780acc6279a389b4d873d5c (2 threads)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa1419bae49017e148021c95226bc8d23a36de9b (3 threads)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=494e184a12b5ea30721d145c8353f354a4724ef3 (4 threads)

I've already seen one task with `hg update` times reduced by >2 minutes, so things are looking promising.
Flags: needinfo?(gps)
set worker.backgroundclose=false in global hg config
Attachment #8990934 - Flags: review?(gps)
(In reply to Gregory Szorc [:gps] from comment #3)
> I have Try pushes at:

I created a spreadsheet from manually inspected data on these pushes, (because the PERFHERDER data isn't injested so I couldn't use an auto-created query)

https://docs.google.com/spreadsheets/d/1YIQSA2LlkL7NEMsvTQdxz3j9YwsiFrZG8vd5Nnrzp_I/edit#gid=0

I have given :gps access to edit, but it seems clear to me that no-worker was the best variant.
Comment on attachment 8990934 [details] [review]
https://github.com/mozilla-releng/OpenCloudConfig/pull/161

This has been reviewed on GitHub for a few days.
Attachment #8990934 - Flags: review?(gps) → review+
I submitted a patch upstream to disable threaded workers for working directory operations: https://phab.mercurial-scm.org/D3963

Hopefully that gets approved (in some form) and lands in 4.7.
(In reply to Gregory Szorc [:gps] from comment #6)
> Comment on attachment 8990934 [details] [review]
> https://github.com/mozilla-releng/OpenCloudConfig/pull/161
> 
> This has been reviewed on GitHub for a few days.

thank you! now merged.
Rob, can you confirm when this went live on the workers? (I'm not sure the deployment story with OCC).
Assignee: nobody → rthijssen
(In reply to Justin Wood (:Callek) from comment #9)
> Rob, can you confirm when this went live on the workers? (I'm not sure the
> deployment story with OCC).

This went live with the merge on 2018-07-18. Running instances would have downloaded the new hg configuration file as soon as the change was observed (between tasks).
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: