Closed Bug 1307204 Opened 8 years ago Closed 8 years ago

Convert gecko-L-b-win2012 workers to c4.4xlarge

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: grenade)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

gecko-L-b-win2012 workers are currently running on c3.2xlarge instances. Linux and Mac builders are all running c4.4xlarge or m4.4xlarge instances. Converting the Linux and Mac builders resulted in a significant build time speedup due to access to more CPU cores.

Currently, Windows build tasks take little more than an hour. Compare with Linux and OS X, which are ~30 minutes. Windows is by far the long pole.

While there are significant portions of the build task where not all CPUs are active, I think throwing beefier EC2 instances at the Windows build will shave several minutes off builds and end-to-end times.

Note that switching from c3 to c4 will require transitioning to EBS storage. This may help mitigate bug 1305174.
garndt: this bug has a drastic impact on end-to-end times for Windows builds. In theory, we could update the worker via AWS Provisioner. However, I think there will be implications with the switch from c3 ephemeral volumes to EBS. So it is probably best if a TC person handles it. Could you please triage/prioritize this bug?
Flags: needinfo?(garndt)
Since these are the worker types I think Rob handles, I will loop him in.
Flags: needinfo?(garndt) → needinfo?(rthijssen)
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bautoland,a3ef1745f26c90df06bd9cee74c0f0b725cd7cdc,1,2%5D says our times to perform just compilation are hovering around 30 minutes.

For reference, my i7-6700K can do that in under 20 minutes. So throwing more and faster cores at the problem should yield significant results. I wouldn't be surprised if we could shave 20 minutes off the compile tier time by upgrading the instance type to a c4.4xlarge or c4.8xlarge.
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
Flags: needinfo?(rthijssen)
FWIW, a c4.8xlarge with a gp1 EBS volume at 600 IOPS can do a `mach build` in ~14 minutes wall time. 10-11 minutes is the compile tier. Cores are pretty hot most of the compile tier except the last ~2 minutes, where it waits on a SpiderMonkey source file and xul.dll linking. We could probably get `mach build` wall down to 10-12 minutes with some build system tweaking.
Do you have an estimate in the change in cost-per-build (at least a ratio)?
Here are current spot prices in use1.

Current Linux builders are using c4.4xlarge @ $0.15/hr.
Current Windows builders are using c3.2xlarge @ $0.40/hr.

Windows c4.2xlarge is $0.40/hr, c4.4xlarge is $0.80/hr, and c4.8xlarge is $1.60/hr.

Linux c4.8xlarge is $0.27/hr.

Windows instances are expensive :/ And TBH, a c4.8xlarge will waste most of its cores for most of a build task since builds not only compile, they create test packages, build symbols, run `make check`, run Python unit tests, etc. We can probably justify c4.8xlarge on Linux to buy a handful of minutes for insanely fast builds on that platform. But the cost for Windows may be prohibitive.

As far as this bug goes, moving from c3.2xlarge to c4.2xlarge would be a win, as the c4's have faster CPUs than c3's.
Thanks!  I'm not opposed to paying more to save time, but there's a balance point at which the cost outweighs the benefit, so it's good to at least estimate the costs!
Comment on attachment 8803405 [details]
Bug 1307204 - use c: drive for build and hg cache;

https://reviewboard.mozilla.org/r/87682/#review86654

c:\ on c4's is a root EBS volume and I/O will be slow per bug 1305174. We should be explicitly mounting a separate EBS volume on these worker types and perform all performance sensitive work from that mounted drive.
Attachment #8803405 - Flags: review?(gps) → review-
Depends on: 1315273
this is now implemented (new instances are c4.4xlarge): https://github.com/mozilla-releng/OpenCloudConfig/commit/1d41a8f4a6e0691f551fed732b19b18960def3df

watching for build failures...
For archeology sake, this was reverted on Friday because things broke. I believe grenade is still working on rolling this out.
g-w is shutting down the instances as soon as g-w starts (excerpt from pt logs):

Nov 14 07:01:37 i-06272f1dc5bf6e4e9.gecko-1-b-win2012-beta.euc1.mozilla.com OpenCloudConfig:  generic-worker installation detected. 
Nov 14 07:01:37 i-06272f1dc5bf6e4e9.gecko-1-b-win2012-beta.euc1.mozilla.com OpenCloudConfig:  waiting for generic-worker process to start. 
Nov 14 07:01:37 i-06272f1dc5bf6e4e9.gecko-1-b-win2012-beta.euc1.mozilla.com OpenCloudConfig:  generic-worker running process detected 184 ms after task-claim-state.valid flag set. 
Nov 14 07:01:37 i-06272f1dc5bf6e4e9.gecko-1-b-win2012-beta.euc1.mozilla.com OpenCloudConfig:  userdata run completed 
Nov 14 07:02:18 i-06272f1dc5bf6e4e9.gecko-1-b-win2012-beta.euc1.mozilla.com User32:  The process C:\Windows\System32\shutdown.exe (I-06272F1DC5BF6) has initiated the shutdown of computer I-06272F1DC5BF6 on behalf of user I-06272F1DC5BF6\GenericWorker for the following reason: No title for this reason could be found   Reason Code: 0x800000ff   Shutdown Type: shutdown   Comment:  

hopefully this pr will give the logging system time to tell us what's going on before the instance is terminated: https://github.com/taskcluster/generic-worker/pull/29
Flags: needinfo?(pmoore)
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/028176f5c386f9206c85e37112aac4123b74c36d
bug 1307204 - give system time to log exceptions

c4.4xlarge instances are being terminated by g-w before doing any work. papertrail [logs](https://papertrailapp.com/systems/532378943/events?focus=734832384771678217) contain only the shutdown command initiated by the g-w user and no reason for the halt (eg: "Nov 14 07:02:18 i-06272f1dc5bf6e4e9.gecko-1-b-win2012-beta.euc1.mozilla.com User32:  The process C:\Windows\System32\shutdown.exe (I-06272F1DC5BF6) has initiated the shutdown of computer I-06272F1DC5BF6 on behalf of user I-06272F1DC5BF6\GenericWorker for the following reason: No title for this reason could be found   Reason Code: 0x800000ff   Shutdown Type: shutdown   Comment:  "). by delaying the shutdown, i am hoping to capture the g-w logs before the instance is terminated.

https://github.com/taskcluster/generic-worker/commit/20dede6168af739533fcb2ce71e40426607c8e2c
Merge pull request #29 from grenade/patch-4

bug 1307204 - give system time to log exceptions
The linux gecko builders have 1 120 GB EBS volume. Keep in mind that EBS IOPS scale with volume size. I fear 20 GB volumes are both too small and not fast enough.

We also don't need multiple EBS volumes on the Windows builders. Pretty sure that is a cargo cult from the days of ephemeral storage, where you had multiple drives.
https://github.com/mozilla-releng/OpenCloudConfig/commit/e80ed01f9615e6486a2dd6c9a447a9e438340a0f

this has just kicked in (in the same form as the successful test run) for all tc win builds (try, m-i, m-c, al, date). build times look good so far. opt and debug builds are fairly uniformly at 41 ~ 45 minutes down from 70 ~ 85 minutes with a few slower builds on the date branch (not sure why) and pgo is of course still slower than the rest.

i will experiment with larger ebs volumes in the coming days. if there is a marked increase in performance, i guess it would justify the cost of the extra space (g-w deletes task directories when the task completes, so i think we only need more space if it increases performance).

the y: drive is currently being used for cache data (hg, tooltool, sccache - the latter 2 not yet working) with the idea being to separate task data on z: from cache/shared data on y: so that we can implement quick formatting of the task data drive on task completion without losing caches. that is also not implemented yet, so perhaps we won't bother with the idea and streamline to a single ebs volume.

we can have a play in the coming days as we tweak this setup.
we had a very high rate of claim-expired triggered retries (blue "B"s in treeherder) overnight due to failures to delete task data and subsequently full z: drives after a task run or two. pmoore is patching g-w to deal with the deletion issues and i have extended the z: drive size to 80gb (matching c3.2xlarge ephemeral size) to get us through to that patch.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
i tested c4.8xlarge instances to see if there would be any performance gains but they mirrored the performance on c4.4xlarge. eg: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fd56ef6283bc691da03f50ce9d4c38c55c849562&selectedJob=31120267 (the later green retriggers are on c4.8xlarge)
End to end times aren't necessarily the best indicator of performance because the first run on a worker will need to spend several minutes obtaining a Mercurial clone and large files from tooltool.

The thing that should be measured is the end-to-end time inside the build system itself, particularly in the "compile" tier. Unfortunately, it looks like our Perfherder metrics aren't working correctly on these builds, so getting those numbers is a bit difficult :/ The lack of Perfherder on TC Windows builds is definitely a bug that needs addressed.
- created bug 1317976 to track fixing the perf counters.
- tested c4.2xlarge with ebs rather than ephemeral drives and build times went up to 60 minutes.

since c4.8xlarge yield negligible improvement and c4.2xlarge yields considerable delay, it seems that c4.4xlarge is the sweet spot for win builds (until we change build technique to take advantage of more cores, i guess).
The compile phase of the build should parallelize as far as you can throw hardware at it, but other phases of the build do not, so there are definitely diminishing returns for builds in automation. We also know that I/O slowness on Windows makes it harder to scale.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: