Closed Bug 1123911 Opened 9 years ago Closed 9 years ago

fw1.releng.scl3.mozilla.net routing failures - BGP use1

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: jlund, Assigned: dcurado)

Details

Attachments

(1 file)

nagios-releng> Tue 13:13:56 PST [4056] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-2 (usw1/169.254.255.77) uptime *111* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)

smokeping: http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1

use1 has been flapping periodically throughout the day. unclear whether it is our (scl3) or aws end. no aws ticket has been started as of yet
netops investigated[1]. Since this is in-house <-> internet <-> aws, there could be many points of failure and those points could be somewhere outside of our and Amazon's end.

Note: we just flapped[2] again, it's unclear whether this affected any infra jobs.

dcurado proposed if this continues we could try forcing the traffic out a different link and opening an AWS ticket. I think both should be done and will follow up tomorrow AM unless we have 0 loss over night. 

action items:
   - "open AWS ticket" rail: is this something I can do myself?
   - "try forcing the traffic out a different link" dcurado: seeing we just flapped again, this sounds like something we will need to investigate doing. How much prep time do you need if we want to pull the trigger and who should I talk to tomorrow AM PT?

[1] 15:59:19 <•dcurado> So far it looks better.  BGP sessions have been up for about 2.5 hours, and pings from the firewall to the ipsec endpoints in AWS do not show loss
[2] 17:56:01 <nagios-releng> Tue 17:56:01 PST [4075] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL
Flags: needinfo?(rail)
Flags: needinfo?(dcurado)
Opened case 1321006051 with AWS.
Flags: needinfo?(rail)
I'm going to close this now since 1) we are no longer hitting this issue at an abnormal rate and 2) aws said there was nothing they can do in case 1321006051
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(dcurado)
Resolution: --- → WORKSFORME
flapped again today @ 14:13 PT and fully recovered @ 14:33

lost a number of running jobs. re-opening this for now with a needinfo to myself to close if we don't see it again


fallout:
14:13:53 <nagios-releng> Wed 14:13:52 PST [4932] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *23* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)
14:15:33 <nagios-releng> Wed 14:15:32 PST [4935] buildbot-master01.bb.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.mozilla.org/ntp+time)
14:17:14 <arr> that's not good
14:18:53 <nagios-releng> Wed 14:18:52 PST [4936] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *50* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)
14:19:01 <arr> jlund|buildduty: I bet those to puppet errors for bm03 and bm76 are due to that ^^
14:19:57 <jlund|buildduty> oh, fun!
14:20:03 <•coop|mtg> :(
14:20:33 <nagios-releng> Wed 14:20:32 PST [4939] buildbot-master01.bb.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.mozilla.org/ntp+time)
14:20:48 — jlund|buildduty checks bm01
14:22:30 <arr> jlund|buildduty: it's probably the use1 BGP thing for that, too
14:22:51 <jlund|buildduty> yes. likely
14:23:07 <jlund|buildduty> jobs still running. all green still for recent builds
14:23:23 <nagios-releng> Wed 14:23:23 PST [4940] buildbot-master02.bb.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/MySQL+Connectivity)
14:23:33 <nagios-releng> Wed 14:23:33 PST [4941] buildbot-master113.bb.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/MySQL+Connectivity)
14:23:53 <nagios-releng> Wed 14:23:53 PST [4942] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is WARNING: SNMP WARNING - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *350* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)
14:24:05 <jlund|buildduty> and there goes mysql
14:24:20 — jlund|buildduty checks smokepings
14:24:22 <nagios-releng> Wed 14:24:22 PST [4945] buildbot-master117.bb.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/MySQL+Connectivity)
14:24:33 <nagios-releng> Wed 14:24:33 PST [4946] buildbot-master51.bb.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/MySQL+Connectivity)
14:24:36 <jlund|buildduty> Callek: KWierso|sheriffduty ^ put your seatbelts on
14:24:57 <jlund|buildduty> use1 is all over the place
14:25:31 <•Callek> wooo-hooo
14:25:33 <nagios-releng> Wed 14:25:33 PST [4947] buildbot-master01.bb.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.mozilla.org/ntp+time)
14:25:53 <KWierso|sheriffduty> jlund|buildduty: closure-worthy?
14:27:12 <jlund|buildduty> I don't think jobs will start pending dramatically but spurious results should be expected
14:27:21 <jlund|buildduty> I bet we will have bustage fallout of jobs
14:28:53 <nagios-releng> Wed 14:28:52 PST [4948] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is OK: SNMP OK - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime 651 secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2)
14:30:13 <nagios-releng> Wed 14:30:12 PST [4951] buildbot-master01.bb.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset 0.001215457916 secs (http://m.mozilla.org/ntp+time)
14:30:39 <arr> jlund|buildduty: was talking to dcurrado on #sysadmins, and he says that the connection looks stable now


recovery
14:33:13 <nagios-releng> Wed 14:33:12 PST [4952] buildbot-master02.bb.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 1757106 Threads: 361 Questions: 724978049 Slow queries: 67078 Opens: 72831 Flush tables: 2 Open tables: 2350 Queries per second avg: 412.597 (http://m.mozilla.org/MySQL+Connectivity)
14:33:23 <nagios-releng> Wed 14:33:22 PST [4953] buildbot-master113.bb.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 1757117 Threads: 371 Questions: 724982864 Slow queries: 67080 Opens: 72831 Flush tables: 2 Open tables: 2350 Queries per second avg: 412.597 (http://m.mozilla.org/MySQL+Connectivity)
14:33:36 <jlund|buildduty> here we go
14:33:42 <jlund|buildduty> arr: I'll file to get it documented
14:34:13 <nagios-releng> Wed 14:34:12 PST [4954] buildbot-master117.bb.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 1757170 Threads: 359 Questions: 725014439 Slow queries: 67082 Opens: 72831 Flush tables: 2 Open tables: 2350 Queries per second avg: 412.603 (http://m.mozilla.org/MySQL+Connectivity)
14:34:23 <nagios-releng> Wed 14:34:22 PST [4955] buildbot-master51.bb.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 1757179 Threads: 362 Questions: 725019117 Slow queries: 67083 Opens: 7283
Status: RESOLVED → REOPENED
Flags: needinfo?(jlund)
Resolution: WORKSFORME → ---
Flags: needinfo?(jlund)
Flags: needinfo?(jlund)
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Flags: needinfo?(jlund)
This is an ongoing problem, which could be addressed by establishing a direct connection with AWS.
I am currently assembling information for a cost/benefit analysis... how much will it cost us per month 
for a direct connection, versus how much does it set us back when releng process fail due to problems
on the Internet causing loss on ipsec tunnels.

I would like to leave this bug open as this is an unresolved situation.
Thanks.
Assignee: nobody → dcurado
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Am closing this bug, but have a pointer to it in 
https://bugzilla.mozilla.org/show_bug.cgi?id=962679

which I will be using to track the issues/pain causes by using VPN access to get to AWS, as well
as tracking costs for direct connect.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → INCOMPLETE
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: