Closed
Bug 1474159
Opened 6 years ago
Closed 6 years ago
Windows clone times regressed significantly in late April 2018 (with hg upgrade?)
Categories
(Infrastructure & Operations :: RelOps: OpenCloudConfig, defect)
Infrastructure & Operations
RelOps: OpenCloudConfig
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: grenade)
Details
(Keywords: regression)
Attachments
(3 files)
Hey, so I'm looking at per-locale nightly-l10n timing, and it is pretty clear that we have a regression on windows only (see attached graph screenshots) The mercurial range for the regression is https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e33e5049d9f35e16893f5ba4ec823f69f04972dd&tochange=8d85e4da25463ccdcf902ac5a846773b8b642553 While two randomly selected tasks from that range are: https://tools.taskcluster.net/groups/BlO0GndDTK-j0b5LSKQj3Q/tasks/RbVoTK3LSy6ekwvi9onlIA/runs/0/artifacts https://tools.taskcluster.net/groups/TYX1YnVySe2p1xxNi2OM-w/tasks/QaR5Mpw8Sk-C9l5Yt32Jsg/details If you see those logs you'll see the OpenCloudConfig csets, which gives us additional regression range of https://github.com/mozilla-releng/OpenCloudConfig/compare/3e0ff7feb6513c98074ef272425b2173c219e771...8a69ab18952510bf2569b0549ffd78712426b4be Looking at both ranges the only likely culprit is the Mercurial upgrade, and indeed in the pre-regression task clone time is about 5 minutes, and post-regression about 20 minutes!
Reporter | ||
Comment 1•6 years ago
|
||
Reporter | ||
Comment 2•6 years ago
|
||
p.s. Ignore the "per task" in the Y column of those plots, these *are* really per-locale
Updated•6 years ago
|
Keywords: regression
Updated•6 years ago
|
Flags: needinfo?(gps)
Comment 3•6 years ago
|
||
First, it's worth noting exactly what VCS operation is slow: there's a difference between: a) initial clone of repository to local disk b) fetching new data from server to local disk c) updating the working directory to contain filed tracked under version control Fortunately, modern versions of `hg robustcheckout` differentiate these operations and record metrics for how long each takes. e.g. PERFHERDER_DATA: {"framework": {"name": "vcs"}, "suites": [{"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "clone", "shouldAlert": false, "subtests": [], "value": 282.5940001010895}, {"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "pull", "shouldAlert": false, "subtests": [], "value": 9.003999948501587}, {"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "update", "shouldAlert": false, "subtests": [], "value": 873.2530000209808}, {"extraOptions": ["c4.4xlarge"], "lowerIsBetter": true, "name": "overall", "shouldAlert": false, "subtests": [], "value": 1166.6189999580383}]} The "cloning" bit in this example takes 282.59s, or 4:42s. The "update" bit takes 873.25s, or 14:33s. Both those times are pretty slow (compared to Linux). And both are much slower than what I'm able to achieve on my local Windows machine. I've long suspected our Windows EC2 instances have very sub-optimal I/O. But I haven't had time to investigate it further. But that's for another bug, as this one is about a regression. While I haven't tested, I suspect the regression is caused by both the Mercurial upgrade *and* the disabling of Windows Defender in https://github.com/mozilla-releng/OpenCloudConfig/commit/4f1899294022a1ff944e62d209d4445d08a09463. Mercurial uses a thread pool for closing file handles on Windows to work around Windows Defender scanning files on close. When Windows Defender is running, this can shave *minutes* off of operations like `hg clone` and `hg update` that write hundreds of thousands of files. However, when Windows Defender isn't running, this thread pool is overhead and slows down operations. Setting worker.backgroundclose=false in the hgrc (or on the command line via `--config worker.backgroundclose=false` will disable this feature and will likely make operations faster now that Windows Defender is disabled (4f1899294022a1ff944e62d209d4445d08a09463). The Mercurial 4.5 upgrade likely regressed performance because Mercurial uses a thread pool for working directory updates in 4.5: https://www.mercurial-scm.org/repo/hg/rev/02b36e860e0b. In theory this should make `hg update` faster. However, because it is a thread pool and Python's multithreading is horrible (because the GIL) and because the c4.4xlarge instances have 16 CPUs, we're likely over-saturating with too many threads and the context switching is likely grinding Python to a crawl. We may want to pin the number of workers/threads on Windows to a reasonable ceiling, which we derive from measuring what works best. We can do this by setting the worker.numcpus config option. My experience tells me we won't see much of a speedup after 2 or 3 threads. But the I/O on these EC2 instances is wonky. So my experience with Windows I/O and Python performance may not translate well. We should definitely measure. I have Try pushes at: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ebac746ac0d2b0e6d8f641b7f6e50c6afde7a48d (no background close) https://treeherder.mozilla.org/#/jobs?repo=try&revision=4d0e65791ca031894f9bab2c74c34629bdf46499 (no background close / no worker) https://treeherder.mozilla.org/#/jobs?repo=try&revision=25c07afca721eeb39563897bfb0cad1ccf5ff283 (no background close / 2 threads) https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0eabaf7ed1da000325cc5d40806d2a837d60ee6 (no worker) https://treeherder.mozilla.org/#/jobs?repo=try&revision=dc1d052f98f4e7827780acc6279a389b4d873d5c (2 threads) https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa1419bae49017e148021c95226bc8d23a36de9b (3 threads) https://treeherder.mozilla.org/#/jobs?repo=try&revision=494e184a12b5ea30721d145c8353f354a4724ef3 (4 threads) I've already seen one task with `hg update` times reduced by >2 minutes, so things are looking promising.
Flags: needinfo?(gps)
Assignee | ||
Comment 4•6 years ago
|
||
set worker.backgroundclose=false in global hg config
Attachment #8990934 -
Flags: review?(gps)
Reporter | ||
Comment 5•6 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #3) > I have Try pushes at: I created a spreadsheet from manually inspected data on these pushes, (because the PERFHERDER data isn't injested so I couldn't use an auto-created query) https://docs.google.com/spreadsheets/d/1YIQSA2LlkL7NEMsvTQdxz3j9YwsiFrZG8vd5Nnrzp_I/edit#gid=0 I have given :gps access to edit, but it seems clear to me that no-worker was the best variant.
Comment 6•6 years ago
|
||
Comment on attachment 8990934 [details] [review] https://github.com/mozilla-releng/OpenCloudConfig/pull/161 This has been reviewed on GitHub for a few days.
Attachment #8990934 -
Flags: review?(gps) → review+
Comment 7•6 years ago
|
||
I submitted a patch upstream to disable threaded workers for working directory operations: https://phab.mercurial-scm.org/D3963 Hopefully that gets approved (in some form) and lands in 4.7.
Assignee | ||
Comment 8•6 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #6) > Comment on attachment 8990934 [details] [review] > https://github.com/mozilla-releng/OpenCloudConfig/pull/161 > > This has been reviewed on GitHub for a few days. thank you! now merged.
Reporter | ||
Comment 9•6 years ago
|
||
Rob, can you confirm when this went live on the workers? (I'm not sure the deployment story with OCC).
Assignee: nobody → rthijssen
Assignee | ||
Comment 10•6 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #9) > Rob, can you confirm when this went live on the workers? (I'm not sure the > deployment story with OCC). This went live with the merge on 2018-07-18. Running instances would have downloaded the new hg configuration file as soon as the change was observed (between tasks).
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•