Closed
Bug 1168812
Opened 9 years ago
Closed 9 years ago
Integrations Trees closed since build times for Windows Exploded accross trees
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Assigned: q)
References
Details
Attachments
(1 file)
3.54 KB,
text/x-ms-regedit
|
Details |
like https://treeherder.mozilla.org/logviewer.html#?job_id=10185259&repo=mozilla-inbound took 305 minutes for a windows xp opt build and then timed out and thats across trees Integration trees closed
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•9 years ago
|
Assignee: nobody → q
Reporter | ||
Comment 18•9 years ago
|
||
closed all trees now since aurora etc are running into the same problem
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Comment 22•9 years ago
|
||
Having to do this update via phone tether until AT&T finishes replacing my hardwareexpect a step by step to follow. After several false starts due to slaves with full disks and some other red herrings. Done research was done during a session with the sheriff on duty and releng. During that session severel jobs unstuck while researching those the idea of load and scaling tuning came up. We decided to try changing the auto tune level to normal from experimental. This limits the size of the rwin and wscale values. After this fix was applied The trees were opened and have stayed open after a few hours of green builds
Assignee | ||
Comment 23•9 years ago
|
||
This issue is the end culmination of a few other issues. Originally an issue documented in Bug 1165314 in which transfer times were slow inter region from EC2 instances to S3 buckets. This caused tasks running in buildbot on windows USE1 EC2 instances to timeout when uploading artifacts to S3 buckets in the USW2 region. After much trouble shooting a solution was found in a combination of TCP stack tweaking and WINSOCK buffer size increases. On 15-05-16 these settings became the default for our EC2 instances. They seemed to perform very well and allow us to take advantage of high bandwidth pipes (100MB – 1GB+) with higher than LAN latency. On 15-05-18 the same S3 issues started showing up in the datacenter the timing seems to be a combination of a patch to tighten up the timeout time for the artifact uploads and an increase in latency to the s3 buckets from around 60ms to 70ms +. These artifact upload timeouts closed the trees and under great rush a patch was created in Bug 1166415. This patch is small subset of the aggressive fixes for AWS with allowances made for hardware in data center as opposed to the XEN based VMs in AWS. This patch fixed the upload times and performance seemed great. On 15-05-21 in Bug 1167475, it was noted that slave connection resets were causing an old issue to surface were objdirs experienced corruption. Basically the slave loses contact while acting on an objdir causing inconsistencies in the data in that dir. The only fix is to request a clobber on the next job on that slave. The objdir corruption issue appeared to happen in batches during certain periods. After it became obvious that this was a potential widespread catastrophe waiting to happen, we started seriously looking into the issue and working on porting the full set of original AWS network fix settings over to the datacenter and testing them. On 2015-05-26 the issue was being investigated and around 2230 PST TCP resets became so problematic that active connections began resetting and ICMP losses became evident. The working theory is that increases in load exacerbated this issue. I personally had trouble keeping a constant RDP or VNC session running. I pushed the lightly tested full set of network settings to some test machines and the drop issues stopped immediately and some builds completed. I pushed the settings out wider and the fix behavior held true. The next day 2015-05-27 issues documented in this Bug 1168812 began to show a new symptom of “hangs” during job execution leading to job timeouts. It was initially erroneously believed that this was due to the mass reset issues the night before causing objdir corruption which seemed initially supported by the first few sample machine which exhibited the objdir behavior. Jobs were pushed with the consent of the Sherriffs around 0900 PST and we started to see hang failures around 1100 PST. The first 3 machines identified during this troubleshooting phase had hard disks that were full or dangerously close to full (within MegaBytes of full). After that was disproven as a root cause it was noticed that there were some oddness with mercurial installs being triggered which was also dismissed as a root cause then fixed and is covered in another bug. A meeting with RELENG, RELOPS, and the Sheriff on duty yielded a deeper dive into the issue where we observed that an xpcshell command was hanging in all cases and that an error “ERROR!! Request Sync Service - Failed to retrieve data from the database” was appearing before many hangs. During this dive around 1320 PST a fuzzer job that was previously hanging completed. We found evidence of a few more doing the same as the pools emptied. This made an odd esoteric link come to mind between load and aggressive window scaling at lower latency latencies. We had been using auto tune scaling algorithm at the level of “experimental” since the first patch and it works in AWS and hadn’t been a big problem in the dc during the initial rollout. The setting allows RWIN and WSCALE values over 16 megabits. This is great for very high bandwidth connections with latencies over 30ms – 40ms however, it can problematic on low latency high bandwidth LAN connections under load. It is possible it may even cause drops. Around 1330 a change was issued to all GPO machines to set the global auto tuning setting to normal. (this can be done via the command line with “netsh int tcp set global autotuning=normal” ) Within a few minutes more jobs dislodged. Using the fuzzer jobs as a canary due to their small size and shorter time out of 1800 seconds we were able to see jobs complete. The decision to open the trees was made and several hours later jobs are still green.
Assignee | ||
Comment 24•9 years ago
|
||
The tech fiddly bits below are the netsh commands that are issued to get the proper network settings in conjunction with the reg file to be attached to this ticket. The commands have a few over laps with the registry settings but allow changes to take affect immediately and make setting things like the MTU much easier as the allow references to interface aliases instead of the interface uids in the registry. netsh int tcp set global autotuning=normal netsh int tcp set global dca=enabled netsh int tcp set global netdma=enabled netsh int tcp set global congestionprovider=ctcp netsh int tcp set global ecncapability=disabled netsh interface tcp set heuristics disabled netsh interface ipv4 set subinterface "Local Area Connection" mtu=1500 store=persistent netsh interface ipv4 set subinterface "Local Area Connection 2" mtu=1500 store=persistent
Assignee | ||
Comment 25•9 years ago
|
||
registry settings in in .reg importable format for hardware machine should be used in combination with netsh commands. netsh int tcp set global autotuning=normal netsh int tcp set global dca=enabled netsh int tcp set global netdma=enabled netsh int tcp set global congestionprovider=ctcp netsh int tcp set global ecncapability=disabled netsh interface tcp set heuristics disabled netsh interface ipv4 set subinterface "Local Area Connection" mtu=1500 store=persistent netsh interface ipv4 set subinterface "Local Area Connection 2" mtu=1500 store=persistent
Comment 26•9 years ago
|
||
Things seem stable since this change last week.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•