Closed Bug 1168812 Opened 9 years ago Closed 9 years ago

Integrations Trees closed since build times for Windows Exploded accross trees

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: q)

References

Details

Attachments

(1 file)

3.54 KB, text/x-ms-regedit
Details
like https://treeherder.mozilla.org/logviewer.html#?job_id=10185259&repo=mozilla-inbound took 305 minutes for a windows xp opt build and then timed out and thats across trees 

Integration trees closed
Depends on: 1167475
Assignee: nobody → q
closed all trees now since aurora etc are running into the same problem
Having to do this update via phone tether until AT&T finishes replacing my hardwareexpect a step by step to follow. After several false starts due to slaves with full disks and some other red herrings. Done research was done during a session with the sheriff on duty and releng. During that session severel jobs unstuck while researching those the idea of load and scaling tuning came up. We decided to try changing the auto tune level to normal from experimental. This limits the size of the rwin and wscale values. After this fix was applied The trees were  opened and have stayed open after a few hours of green builds
 This issue is the end culmination of a few other issues. Originally an issue documented in Bug 1165314 in which transfer times were slow inter region from EC2 instances to S3 buckets. This caused tasks running in buildbot on windows USE1 EC2 instances to timeout when uploading artifacts to S3 buckets in the USW2 region. After much trouble shooting a solution was found in a combination of TCP stack tweaking and WINSOCK buffer size increases. On 15-05-16 these settings became the default for our EC2 instances. They seemed to perform very well and allow us to take advantage of high bandwidth pipes (100MB – 1GB+) with higher than LAN latency. 

 On 15-05-18 the same S3 issues started showing up in the datacenter the timing seems to be a combination of a patch to tighten up the timeout time for the artifact uploads and an increase in latency to the s3 buckets from around 60ms to 70ms +. These artifact upload timeouts closed the trees and under great rush a patch was created in Bug 1166415. This patch is small subset of the aggressive fixes for AWS with allowances made for hardware in data center as opposed to the XEN based VMs in AWS. This patch fixed the upload times and performance seemed great.

 On 15-05-21 in Bug 1167475, it was noted that slave connection resets were causing an old issue to surface were objdirs experienced corruption. Basically the slave loses contact while acting on an objdir causing inconsistencies in the data in that dir. The only fix is to request a clobber on the next job on that slave. The objdir corruption issue appeared to happen in batches during certain periods. After it became obvious that this was a potential widespread catastrophe waiting to happen, we started seriously looking into the issue and working on porting the full set of original AWS network fix settings over to the datacenter and testing them. On 2015-05-26 the issue was being investigated and around 2230 PST TCP resets became so problematic that active connections began resetting and ICMP losses became evident. The working theory is that increases in load exacerbated this issue. I personally had trouble keeping a constant RDP or VNC session running. I pushed the lightly tested full set of network settings to some test machines and the drop issues stopped immediately and some builds completed. I pushed the settings out wider and the fix behavior held true.  The next day 2015-05-27 issues documented in this Bug 1168812 began to show a new symptom of “hangs” during job execution leading to job timeouts. It was initially erroneously believed that this was due to the mass reset issues the night before causing objdir corruption which seemed initially supported by the first few sample machine which exhibited the objdir behavior. Jobs were pushed with the consent of the Sherriffs around 0900 PST and we started to see hang failures around 1100 PST. The first 3 machines identified during this troubleshooting phase had hard disks that were full or dangerously close to full (within MegaBytes of full). After that was disproven as a root cause it was noticed that there were some oddness with mercurial installs being triggered which was also dismissed as a root cause then fixed and is covered in another bug.  A meeting with RELENG, RELOPS, and the Sheriff on duty yielded a deeper dive into the issue where we observed that an xpcshell command was hanging in all cases and that an error “ERROR!! Request Sync Service - Failed to retrieve data from the database” was appearing before many hangs.  During this dive around 1320 PST a fuzzer job that was previously hanging completed. We found evidence of a few more doing the same as the pools emptied. This made an odd esoteric link come to mind between load and aggressive window scaling at lower latency latencies. We had been using auto tune scaling algorithm at the level of “experimental” since the first patch and it works in AWS and hadn’t been a big problem in the dc during the initial rollout. The setting allows RWIN and WSCALE values over 16 megabits. This is great for very high bandwidth connections with latencies over 30ms – 40ms however, it can problematic on low latency high bandwidth LAN connections under load. It is possible it may even cause drops. Around 1330 a change was issued to all GPO machines to set the global auto tuning setting to normal. (this can be done via the command line with “netsh int tcp set global autotuning=normal” ) Within a few minutes more jobs dislodged.  Using the fuzzer jobs as a canary due to their small size and shorter time out of 1800 seconds we were able to see jobs complete. The decision to open the trees was made and several hours later jobs are still green.
The tech fiddly bits below are the netsh commands that are issued to get the proper network settings in conjunction with the reg file to be attached to this ticket. The commands have a few over laps with the registry settings but allow changes to take affect immediately and  make setting things like the MTU much easier as the allow references to interface aliases instead of the interface uids in the registry. 


netsh int tcp set global autotuning=normal
netsh int tcp set global dca=enabled
netsh int tcp set global netdma=enabled
netsh int tcp set global congestionprovider=ctcp
netsh int tcp set global ecncapability=disabled
netsh interface tcp set heuristics disabled
netsh interface ipv4 set subinterface "Local Area Connection" mtu=1500 store=persistent
netsh interface ipv4 set subinterface "Local Area Connection 2" mtu=1500 store=persistent
Attached file tcp_param.reg
registry settings in in .reg importable format for hardware machine should be used in combination with netsh commands.

netsh int tcp set global autotuning=normal
netsh int tcp set global dca=enabled
netsh int tcp set global netdma=enabled
netsh int tcp set global congestionprovider=ctcp
netsh int tcp set global ecncapability=disabled
netsh interface tcp set heuristics disabled
netsh interface ipv4 set subinterface "Local Area Connection" mtu=1500 store=persistent
netsh interface ipv4 set subinterface "Local Area Connection 2" mtu=1500 store=persistent
Things seem stable since this change last week.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: