910818 - Please investigate cause of network disconnects 2013-08-29 10:22-10:24 Pacific

Reporter

Description

•

11 years ago

We experienced numerous build slave disconnects (on existing network connections established before the outage) on 2013-08-29 between 10:22 and 10:24.  This caused many builds to fail and have to restart.  Can you please investigate the cause?

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182526&tree=Mozilla-Inbound
source: w64-ix-slave117.winbuild.scl1.mozilla.com (10.12.40.149)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:24:39.915172

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182405&tree=Mozilla-Inbound
source: w64-ix-slave78.winbuild.scl1.mozilla.com (10.12.40.108)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:22:32.052494

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182419&tree=Mozilla-Inbound
source: w64-ix-slave123.winbuild.scl1.mozilla.com (10.12.40.155)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:22:38.034932

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182531&tree=Mozilla-Inbound
source: w64-ix-slave107.winbuild.scl1.mozilla.com (10.12.40.139)
dest: buildbot-master63.srv.releng.use1.mozilla.com:9001 (10.134.48.196)
time: 2013-08-29 10:24:45.846917

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182399&tree=Mozilla-Inbound
source: w64-ix-slave116.winbuild.scl1.mozilla.com (10.12.40.148)
dest: buildbot-master66.srv.releng.usw2.mozilla.com:9001 (10.132.50.247)
time: 2013-08-29 10:22:33.478881

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182384&tree=Mozilla-Inbound
source: w64-ix-slave120.winbuild.scl1.mozilla.com (10.12.40.152)
dest: buildbot-master66.srv.releng.usw2.mozilla.com:9001 (10.132.50.247)
time: 2013-08-29 10:22:26.527321

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182407&tree=Mozilla-Inbound
source: w64-ix-slave108.winbuild.scl1.mozilla.com (10.12.40.140)
dest: buildbot-master58.srv.releng.usw2.mozilla.com:9001 (10.132.49.125)
time: 2013-08-29 10:22:35.332269

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182398&tree=B2g-Inbound
source: w64-ix-slave118.winbuild.scl1.mozilla.com (10.12.40.150)
dest: buildbot-master65.srv.releng.usw2.mozilla.com:9001 (10.132.49.112)
time: 2013-08-29 10:22:38.532237

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182515&tree=B2g-Inbound
source: w64-ix-slave111.winbuild.scl1.mozilla.com (10.12.40.143)
dest: buildbot-master61.srv.releng.use1.mozilla.com:9001 (10.134.49.62)
time: 2013-08-29 10:24:33.499145

log: https://tbpl.mozilla.org/php/getParsedLog.php?id=27182412&tree=B2g-Inbound
source: w64-ix-slave151.winbuild.scl1.mozilla.com (10.12.40.183)
dest: buildbot-master62.srv.releng.use1.mozilla.com:9001 (10.134.48.236)
time: 2013-08-29 10:22:34.410747

I noticed some cases of builds completing successfully on two of the buildbot masters above.  In those cases, the build slaves were on Amazon (as opposed to scl1 in the case of all the instances above).

 http://buildbot-master63.srv.releng.use1.mozilla.com:8001/builders/b2g_b2g-inbound_emulator_dep/builds/64
 http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/Android%20no-ionmonkey%20mozilla-inbound%20build/builds/577
 http://buildbot-master58.srv.releng.usw2.mozilla.com:8001/builders/Android%20Debug%20mozilla-inbound%20build/builds/552

Hal Wine [:hwine] use NI!

Comment 1

•

11 years ago

Notes:
 - all times are PDT
 - comment 0 may not contain all hosts affected by the event. These are simply the ones noticed by us.

casey ransom [:casey]

Assignee

Updated

•

11 years ago

Assignee: network-operations → cransom

casey ransom [:casey]

Assignee

Comment 2

•

11 years ago

I saw no events on the mozilla network nor vpn failures in scl1 or scl3. Given that your failure window was 2 minutes, that is shorter than our default timers for BGP failover of the VPN tunnel, so it's possible there was internet churn for 2 minutes or Amazon VPC VPN endpoints had problems.  If you can make a target available (a static host that responds to ping) in each region or AZ for our smokeping instance we can add it to monitoring and get a better idea of the failure radius for future issues.  Please keep in mind that Amazon provides no SLA for VPC connectivity (VPN or their Direct Connect service) so slaves in scl1/scl3 connecting to masters in the cloud will likely be less stable than to masters housed in mozilla infrastructure.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → WORKSFORME

Hal Wine [:hwine] use NI!

Comment 3

•

11 years ago

Thanks - we may have those hosts available for smokeping -- see bug 896812 comment 18. I'll put a note there, so :ashish can discuss with you when he's back.

Armen [:armenzg]

Comment 4

•

11 years ago

So far we know we have seen this happening for Win64 build machines on SCL1.

Have we seen this happening for other platforms?
Any of them on SCL3? or MTV1?

I want to collect more info about this problem and see what we can do about it.

It would be great if we changed the way buildbot works to minimize the amount of alive network connections it establishes.

Do we know if it was a DNS issue? or a TCP dropped connection?

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

11 years ago

DNS won't caused disconnects - the lookup happens before the connection is established.

John Hopkins (:jhopkins)

Reporter

Comment 6

•

11 years ago

What we saw sounds a lot like https://en.wikipedia.org/wiki/Congestive_collapse#Congestive_collapse

casey ransom [:casey]

Assignee

Comment 7

•

11 years ago

Attached image scl1-uplinks.png — Details

Attached are the uplink graphs for scl1.  We are well below maximum throughput capability of the firewall and also a long ways before we hit saturation of the network uplinks.  This is a pretty good graph of what network saturation/congestion doesn't look like.  One thing you can monitor if you want to detect network back pressure is looking at the TCP stack stats on the sender and receiver.  If you see heavily incrementing tcp retransmits (in the SNMP standard mibs, .1.3.6.1.2.1.6.12 or tcpRetransSegs), that is congestion and packet loss.

Armen [:armenzg]

Comment 8

•

11 years ago

> One thing you can monitor if you want to detect network back pressure is looking at
> the TCP stack stats on the sender and receiver.  If you see heavily incrementing tcp
> retransmits (in the SNMP standard mibs, .1.3.6.1.2.1.6.12 or tcpRetransSegs), that is
> congestion and packet loss

Do you know by chance how we can do this on Windows? Or even Linux so we could search about the topic? I'm clueless.

From IRC:
11:54 dustin: the "connection lost" is usually in response to ECONNRESET from the OS socket layer
11:54 Callek: armenzg_lunch: we're talking windows, as far back as XP, all bets are off :-p
11:54 dustin: but that only happens when the socket layer figures out that something is wrong
11:55 dustin: a read() can hang forever waiting for an incoming packet from a slave, and if that packet never arrives, no error is generated (on the side doing the read())
12:55 armenzg: so Windows so far?
12:56 armenzg: dustin: does that mean that Windows' socket layer is more fragile?
12:57 dustin: yes, or more specifically has different timeouts, behaviors, etc.

casey ransom [:casey]

Assignee

Comment 9

•

11 years ago

It took me a bit to find a Windows machine I could stick the SNMP agent on as all of our windows ninjas are out, however I found an XP VM and enabled snmp.

0ogre:~% snmpwalk -v2c -c public 198.18.1.48 tcpRetransSegs.0
TCP-MIB::tcpRetransSegs.0 = Counter32: 0

The same method works for linux. There are other methods that are more useful like collectd plugins, however, SNMP should be universal.  Just take note, the only way the counter stays 0 is if the machine is completely idle. TCP retransmissions are common and there are a great many reasons as to why network traffic may need to be retransmitted. Small increments are normal, large (>1000 in a minute) would be worth looking into, but it depends on the environment.

BMO Automation

Updated

•

2 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

Please investigate cause of network disconnects 2013-08-29 10:22-10:24 Pacific

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

Tracking

(Not tracked)

People

(Reporter: jhopkins, Assigned: cransom)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Attachment

General

Description

File Name

Content Type