Closed Bug 910818 Opened 11 years ago Closed 11 years ago

Please investigate cause of network disconnects 2013-08-29 10:22-10:24 Pacific

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

x86
Windows Server 2003
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: jhopkins, Assigned: cransom)

Details

Attachments

(1 file)

We experienced numerous build slave disconnects (on existing network connections established before the outage) on 2013-08-29 between 10:22 and 10:24.  This caused many builds to fail and have to restart.  Can you please investigate the cause?

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182526&tree=Mozilla-Inbound
source: w64-ix-slave117.winbuild.scl1.mozilla.com (10.12.40.149)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:24:39.915172

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182405&tree=Mozilla-Inbound
source: w64-ix-slave78.winbuild.scl1.mozilla.com (10.12.40.108)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:22:32.052494

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182419&tree=Mozilla-Inbound
source: w64-ix-slave123.winbuild.scl1.mozilla.com (10.12.40.155)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:22:38.034932

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182531&tree=Mozilla-Inbound
source: w64-ix-slave107.winbuild.scl1.mozilla.com (10.12.40.139)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:24:45.846917

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182399&tree=Mozilla-Inbound
source: w64-ix-slave116.winbuild.scl1.mozilla.com (10.12.40.148)
dest: buildbot-master66.srv.releng.usw2.mozilla.com:9001 (10.132.50.247)
time: 2013-08-29 10:22:33.478881

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182384&tree=Mozilla-Inbound
source: w64-ix-slave120.winbuild.scl1.mozilla.com (10.12.40.152)
dest: buildbot-master66.srv.releng.usw2.mozilla.com:9001 (10.132.50.247)
time: 2013-08-29 10:22:26.527321

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182407&tree=Mozilla-Inbound
source: w64-ix-slave108.winbuild.scl1.mozilla.com (10.12.40.140)
dest: buildbot-master58.srv.releng.usw2.mozilla.com:9001 (10.132.49.125)
time: 2013-08-29 10:22:35.332269

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182398&tree=B2g-Inbound
source: w64-ix-slave118.winbuild.scl1.mozilla.com (10.12.40.150)
dest: buildbot-master65.srv.releng.usw2.mozilla.com:9001 (10.132.49.112)
time: 2013-08-29 10:22:38.532237

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182515&tree=B2g-Inbound
source: w64-ix-slave111.winbuild.scl1.mozilla.com (10.12.40.143)
dest: buildbot-master61.srv.releng.use1.mozilla.com:9001 (10.134.49.62)
time: 2013-08-29 10:24:33.499145

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182412&tree=B2g-Inbound
source: w64-ix-slave151.winbuild.scl1.mozilla.com (10.12.40.183)
dest: buildbot-master62.srv.releng.use1.mozilla.com:9001 (10.134.48.236)
time: 2013-08-29 10:22:34.410747

I noticed some cases of builds completing successfully on two of the buildbot masters above.  In those cases, the build slaves were on Amazon (as opposed to scl1 in the case of all the instances above).

 http://buildbot-master63.srv.releng.use1.mozilla.com:8001/builders/b2g_b2g-inbound_emulator_dep/builds/64
 http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/Android%20no-ionmonkey%20mozilla-inbound%20build/builds/577
 http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/Android%20Debug%20mozilla-inbound%20build/builds/552
Notes:
 - all times are PDT
 - comment 0 may not contain all hosts affected by the event. These are simply the ones noticed by us.
Assignee: network-operations → cransom
I saw no events on the mozilla network nor vpn failures in scl1 or scl3. Given that your failure window was 2 minutes, that is shorter than our default timers for BGP failover of the VPN tunnel, so it's possible there was internet churn for 2 minutes or Amazon VPC VPN endpoints had problems.  If you can make a target available (a static host that responds to ping) in each region or AZ for our smokeping instance we can add it to monitoring and get a better idea of the failure radius for future issues.  Please keep in mind that Amazon provides no SLA for VPC connectivity (VPN or their Direct Connect service) so slaves in scl1/scl3 connecting to masters in the cloud will likely be less stable than to masters housed in mozilla infrastructure.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Thanks - we may have those hosts available for smokeping -- see bug 896812 comment 18. I'll put a note there, so :ashish can discuss with you when he's back.
So far we know we have seen this happening for Win64 build machines on SCL1.

Have we seen this happening for other platforms?
Any of them on SCL3? or MTV1?

I want to collect more info about this problem and see what we can do about it.

It would be great if we changed the way buildbot works to minimize the amount of alive network connections it establishes.

Do we know if it was a DNS issue? or a TCP dropped connection?
DNS won't caused disconnects - the lookup happens before the connection is established.
Attached image scl1-uplinks.png
Attached are the uplink graphs for scl1.  We are well below maximum throughput capability of the firewall and also a long ways before we hit saturation of the network uplinks.  This is a pretty good graph of what network saturation/congestion doesn't look like.  One thing you can monitor if you want to detect network back pressure is looking at the TCP stack stats on the sender and receiver.  If you see heavily incrementing tcp retransmits (in the SNMP standard mibs, .1.3.6.1.2.1.6.12 or tcpRetransSegs), that is congestion and packet loss.
> One thing you can monitor if you want to detect network back pressure is looking at
> the TCP stack stats on the sender and receiver.  If you see heavily incrementing tcp
> retransmits (in the SNMP standard mibs, .1.3.6.1.2.1.6.12 or tcpRetransSegs), that is
> congestion and packet loss

Do you know by chance how we can do this on Windows? Or even Linux so we could search about the topic? I'm clueless.

From IRC:
11:54 dustin: the "connection lost" is usually in response to ECONNRESET from the OS socket layer
11:54 Callek: armenzg_lunch: we're talking windows, as far back as XP, all bets are off :-p
11:54 dustin: but that only happens when the socket layer figures out that something is wrong
11:55 dustin: a read() can hang forever waiting for an incoming packet from a slave, and if that packet never arrives, no error is generated (on the side doing the read())
12:55 armenzg: so Windows so far?
12:56 armenzg: dustin: does that mean that Windows' socket layer is more fragile?
12:57 dustin: yes, or more specifically has different timeouts, behaviors, etc.
It took me a bit to find a Windows machine I could stick the SNMP agent on as all of our windows ninjas are out, however I found an XP VM and enabled snmp.

0ogre:~% snmpwalk -v2c -c public 198.18.1.48 tcpRetransSegs.0
TCP-MIB::tcpRetransSegs.0 = Counter32: 0

The same method works for linux. There are other methods that are more useful like collectd plugins, however, SNMP should be universal.  Just take note, the only way the counter stays 0 is if the machine is completely idle. TCP retransmissions are common and there are a great many reasons as to why network traffic may need to be retransmitted. Small increments are normal, large (>1000 in a minute) would be worth looking into, but it depends on the environment.
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: