Closed Bug 1312383 Opened 8 years ago Closed 8 years ago

taskcluster windows 7 ec2 instances underperform

Categories

(Release Engineering :: General, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: grenade)

References

Details

Attachments

(1 file)

cppunit tests on tc win 7 take longer than they could/should to complete.

timings from https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=22e62e6b9783b9ff92a37b4318421b15cb411b1b:

tc win 7 opt: 18 minutes
tc win 7 dbg: 18 minutes

bb win 7 opt: 8 minutes
bb win 7 dbg: 9 minutes

tc win 10 opt: 7 minutes
tc win 10 dbg: 7 minutes

also, when running checkout based tests, tc win 7 fails to complete the hg checkout due to a timeout. tc win 10 performs the checkout quickly. both use ephemeral disks for both the checkout and robustcheckout shared folders. examples at:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7faf377dfbdd56be037f2efe35ca35587a099c4&group_state=expanded&exclusion_profile=false&filter-tier=2

tc win 7 log: https://public-artifacts.taskcluster.net/axL3dLhrRCerYhux3oIcZQ/0/public/logs/live_backing.log
tc win 10 log: https://public-artifacts.taskcluster.net/Wqd3tgCHRkquFWXpqZjzvQ/0/public/logs/live_backing.log
worth noting that the current tc win 7 ami is based on releng ami-60167577 (use1) which was created on 2016-08-24.
today i'm rebasing on releng ami-94790083 which was created on 2016-09-20 and is the current base for golden (https://github.com/mozilla-releng/build-cloud-tools/blob/cffc9e1170a6325cd2f10b750f66907abbc15cde/configs/t-w732#L6)
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
things i tried which did not affect performance (time taken for the cppunit test to complete):
- correct the PythonPath and InstallPath registry settings: https://github.com/mozilla-releng/OpenCloudConfig/commit/f00e2f4cfc0019d73d8614f764fc549c71ee87eb
- use python 2.7.3 (an exact copy of c:\mozilla-build\python27 from buildbot win7 slaves + virtualenv/virtualenv_support from mozilla-build 2.2.0): https://github.com/mozilla-releng/OpenCloudConfig/commit/1ab2f74e73fbb7a32ae8eabb795938f08f8754dc
- remove the pagefile and ram drive (r:) - resulted in no difference to performance (better/worse)
- ngen executeQueuedItems (https://github.com/mozilla-releng/OpenCloudConfig/commit/8599dc41f1672a0531f39c789c20de4a4120871d)

things to try next (it's looking more and more like the .net upgrade causes the slowdown):
- hand build the ami (no occ, no .net 4.5, no powershell upgrade), just an exact copy of bb win7 with runner and buildbot removed and generic worker installed
Thanks for all your hard work on this Rob!
i've now ruled out the .net upgrade/dsc. latest amis are hand built exact copies of buildbot win 7 slaves with the only modification being removal of runner/buildbot and installation of the generic worker.

the only places we could be losing performance now are either in the generic worker or if there was some hack to the cltbld user that somehow made it fast (which seems unlikely).
given the fact that we are losing most of the performance in downloading builds/packages, how can we narrow that down?  It does seem odd that generic-worker could be causing this, unless there is something really odd going on.

We do create users dynamically- I wonder if there is some quirk in windows 7 specifically with the newly created user and some flag to make it act differently.

Looking at :grenade's recent push, I see this log:
https://public-artifacts.taskcluster.net/DvCA9fJdQSWNuZemCVh_HQ/0/public/logs/live_backing.log

I see a few instances of timeout/sleep/retry:
01:58:56     INFO -  c:\mozilla-build\python\python.exe: can't open file 'C:\mozilla-build\tooltool.py': [Errno 2] No such file or directory
01:58:56    ERROR - Return code: 2
01:58:56     INFO - retry: Failed, sleeping 240 seconds before retrying
02:02:56     INFO - retry: Calling run_command with args: (['c:\\mozilla-build\\python\\python.exe', 'C:\\mozilla-build\\tooltool.py', '--url', 


this is the other area of longer delays:
01:52:03     INFO -  Processing z:\task_1477703515\build\tests\marionette\client
01:52:05     INFO -  Processing z:\task_1477703515\build\tests\marionette\marionette\runner\mixins\browsermob-proxy-py
01:52:09     INFO -  Processing z:\task_1477703515\build\tests\marionette
01:52:56     INFO -  Building wheels for collected packages: browsermob-proxy


and 8 seconds to download 818k:
01:50:30     INFO -    Downloading http://pypi.pub.build.mozilla.org/pub/mock-1.0.1.tar.gz (818kB)
01:50:38     INFO -  Building wheels for collected packages: mock

and 15 seconds to install some python modules:
01:47:26     INFO -  Successfully built blobuploader requests docopt
01:47:26     INFO -  Installing collected packages: requests, docopt, blobuploader
01:47:43     INFO -  Successfully installed blobuploader-1.2.4 docopt-0.6.1 requests-1.2.3
01:47:44     INFO - Return code: 0

of course the download of tests.zip is 1+ minutes here:
01:42:51     INFO - x-amz-meta-url: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/eTf1lMx_R5GulM6kwfj12w/0/public/build/firefox-52.0a1.en-US.win32.common.tests.zip
01:42:51     INFO - x-amz-request-id: D7589D676E109906
01:42:51     INFO - Expected file size: 17133438
01:42:51     INFO - Obtained file size: 17133438
01:44:04     INFO - Downloading and extracting to Z:\task_1477703515\build\tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, cppunittest/* from https://queue.taskcluster.net/v1/task/eTf1lMx_R5GulM6kwfj12w/artifacts/public/build/firefox-52.0a1.en-US.win32.cppunittest.tests.zip



the trend is really the high overhead in installing python modules (download and installation).  I really wish we could reduce that in general as this accounts for about 2 minutes of overhead in each job normally- here it seems to account for much more of the time (in the 5+ minutes window)
the tooltool error is a red herring and was fixed in a subsequent ami.

i am aware that there is a performance penalty to all that downloading and i am working to mitigate that by sharing the local pypi cache (between subsequent task users on the same host'.

however, windows 10, which is spun up in the same availability zones, performs the same cppunit test in consistently less time (6-7 min) than buildbot slaves (7-8 min). windows 7 takes 17 - 19 min consistently. this points to something other than the time taken to download since win 10 workers have to perform the same downloads.

since we know that we have absolute parity with the buildbot ami which doesn't underperform, we can also rule out network or io drivers. the only thing left to look at is the process spawning mechanism on the 32 bit version of the generic worker (used only by tc win 7 and no other worker types).

i'm leaning towards some glitch or difference in the way that the operating system reacts to the g-w mechanism for spawning the task process. which should be easy to demonstrate by running the cppunit test outside of generic-worker (manually from a cmd prompt) and comparing the time it takes to complete with the same test run from inside g-w.

updates to follow...
well, its not generic worker either. cppunit took 28 minutes to complete when run outside of g-w.
i've discounted the multiple calls to proxxy and pypi.pvt hosts with this change to mozharness configs (which removed noise from the logs but didn't make for a faster test run): 

 
+    "proxxy": {},
     "find_links": [
-        "http://pypi.pvt.build.mozilla.org/pub",
         "http://pypi.pub.build.mozilla.org/pub",
     ],
pmoore and i did some debugging of a long running process under g-w. a task was created with a single long running command:

hg robustcheckout --sharebase y:\\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 916cbaf21a63da2f7953933481a502d7db9b39a0 https://hg.mozilla.org/integration/mozilla-inbound/ .\\build\\src

the same command (with altered paths) run manually from a cmd prompt, completes in 10 ~ 15 minutes. under g-w it times out incomplete after 3 hours.

using process monitor and process explorer we ascertained that the process running under g-w has a low process priority (the lowest possible) while manual processes have a normal process priority.

overriding the process priority (to increase it) of the g-w process allowed the task to complete in a similar time to the manual process.

pmoore is exploring ways to increase the process priority of the g-w process
Assignee: rthijssen → nobody
Status: ASSIGNED → NEW
i tried bumping up the process priority for the generic-worker process
https://github.com/mozilla-releng/OpenCloudConfig/commit/f260dedd95dd0c5fef0ec04118423ce8e7210536

i tried both "Real Time" (0) and "Above Normal" (3). it didn't seem to make any difference. my guess is that the generic worker process is running with the new priority but that processes it spawns (the builds/test) are not. https://papertrailapp.com/groups/2488493/events?q=%22process+priority+for+generic+worker%22

i also read (http://stackoverflow.com/a/1664006/68115) that in order to increase a process priority, the SeIncreaseBasePriorityPrivilege permission is required, so i've added that to the permissions granted to the g-w user (https://github.com/mozilla-releng/OpenCloudConfig/commit/33ed93c39cae94bd972ff4157247c0d9c16ee632). i'm hoping that helps it to spawn processes with a higher priority.
https://books.google.co.uk/books?id=WZdCAwAAQBAJ&pg=PT506&lpg=PT506&dq=%22Image+File+Execution+Options%22+PerfOptions+Io+Priority+Cpu+Priority&source=bl&ots=nilSGhacsd&sig=x-Q6uDIa3tDquumLyNQZ13pkAtE&hl=en&sa=X&ved=0ahUKEwjKkvCXlrrQAhVCDMAKHbEXCHEQ6AEILDAC#v=onepage&q=%22Image%20File%20Execution%20Options%22%20PerfOptions%20Io%20Priority%20Cpu%20Priority&f=false

this book states "The process's default base-scheduling priority (which starts as Normal, or 8, unless the parent process was set to Idle or Below Normal, in which case the setting is inherited)."

which suggests that until we get the scheduled task starting on a priority higher than "Below Normal", we won't get higher priorities for spawned processes.
i found a workaround! i've applied a registry hack which ensures that hg.exe and python.exe always run with in the "Above Normal" process priority class.

cppunit test now completes in 3 minutes. it was 18 minutes on ephemeral and 12 on ebs volumes previously.

https://github.com/mozilla-releng/OpenCloudConfig/commit/60920c08b0fe314e5526985a588e3f83918145e0
Assignee: nobody → rthijssen
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/e2fa6601905f0003e8c7699fd62e2097b0bcfae5
bug 1312383 - set scheduled task process priority to realtime

not sure it will help, but the current scheduled task runs its process with priority 7 (below normal) while priority 0 (real time) is the highest, 10 (idle) is the lowest.
https://msdn.microsoft.com/en-us/library/windows/desktop/aa383512(v=vs.85).aspx

https://github.com/taskcluster/generic-worker/commit/67b96f57ade0838da11bebf322d3d226c9cf60a2
bug 1312383 - set scheduled task process priority to above normal

testing with realtime yielded build times the same as before (no improvement). setting to above normal in the hope it provokes the os into doing something

https://github.com/taskcluster/generic-worker/commit/ddd5ebb19645ffb971aa3687e5772e61389109a4
Merge pull request #30 from grenade/patch-5

bug 1312383 - set scheduled task process priority to above normal
Attachment #8812741 - Flags: review?(pmoore) → review+
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: