Closed
Bug 1123911
Opened 9 years ago
Closed 9 years ago
fw1.releng.scl3.mozilla.net routing failures - BGP use1
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: jlund, Assigned: dcurado)
Details
Attachments
(1 file)
59.58 KB,
image/png
|
Details |
nagios-releng> Tue 13:13:56 PST [4056] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-2 (usw1/169.254.255.77) uptime *111* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2) smokeping: http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1 use1 has been flapping periodically throughout the day. unclear whether it is our (scl3) or aws end. no aws ticket has been started as of yet
Reporter | ||
Comment 1•9 years ago
|
||
netops investigated[1]. Since this is in-house <-> internet <-> aws, there could be many points of failure and those points could be somewhere outside of our and Amazon's end. Note: we just flapped[2] again, it's unclear whether this affected any infra jobs. dcurado proposed if this continues we could try forcing the traffic out a different link and opening an AWS ticket. I think both should be done and will follow up tomorrow AM unless we have 0 loss over night. action items: - "open AWS ticket" rail: is this something I can do myself? - "try forcing the traffic out a different link" dcurado: seeing we just flapped again, this sounds like something we will need to investigate doing. How much prep time do you need if we want to pull the trigger and who should I talk to tomorrow AM PT? [1] 15:59:19 <•dcurado> So far it looks better. BGP sessions have been up for about 2.5 hours, and pings from the firewall to the ipsec endpoints in AWS do not show loss [2] 17:56:01 <nagios-releng> Tue 17:56:01 PST [4075] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL
Flags: needinfo?(rail)
Flags: needinfo?(dcurado)
Comment 2•9 years ago
|
||
Opened case 1321006051 with AWS.
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(rail)
Reporter | ||
Comment 3•9 years ago
|
||
I'm going to close this now since 1) we are no longer hitting this issue at an abnormal rate and 2) aws said there was nothing they can do in case 1321006051
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(dcurado)
Resolution: --- → WORKSFORME
Reporter | ||
Comment 4•9 years ago
|
||
flapped again today @ 14:13 PT and fully recovered @ 14:33 lost a number of running jobs. re-opening this for now with a needinfo to myself to close if we don't see it again fallout: 14:13:53 <nagios-releng> Wed 14:13:52 PST [4932] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *23* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2) 14:15:33 <nagios-releng> Wed 14:15:32 PST [4935] buildbot-master01.bb.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.mozilla.org/ntp+time) 14:17:14 <arr> that's not good 14:18:53 <nagios-releng> Wed 14:18:52 PST [4936] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is CRITICAL: SNMP CRITICAL - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *50* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2) 14:19:01 <arr> jlund|buildduty: I bet those to puppet errors for bm03 and bm76 are due to that ^^ 14:19:57 <jlund|buildduty> oh, fun! 14:20:03 <•coop|mtg> :( 14:20:33 <nagios-releng> Wed 14:20:32 PST [4939] buildbot-master01.bb.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.mozilla.org/ntp+time) 14:20:48 — jlund|buildduty checks bm01 14:22:30 <arr> jlund|buildduty: it's probably the use1 BGP thing for that, too 14:22:51 <jlund|buildduty> yes. likely 14:23:07 <jlund|buildduty> jobs still running. all green still for recent builds 14:23:23 <nagios-releng> Wed 14:23:23 PST [4940] buildbot-master02.bb.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/MySQL+Connectivity) 14:23:33 <nagios-releng> Wed 14:23:33 PST [4941] buildbot-master113.bb.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/MySQL+Connectivity) 14:23:53 <nagios-releng> Wed 14:23:53 PST [4942] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is WARNING: SNMP WARNING - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime *350* secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2) 14:24:05 <jlund|buildduty> and there goes mysql 14:24:20 — jlund|buildduty checks smokepings 14:24:22 <nagios-releng> Wed 14:24:22 PST [4945] buildbot-master117.bb.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/MySQL+Connectivity) 14:24:33 <nagios-releng> Wed 14:24:33 PST [4946] buildbot-master51.bb.releng.use1.mozilla.com:MySQL Connectivity is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. (http://m.mozilla.org/MySQL+Connectivity) 14:24:36 <jlund|buildduty> Callek: KWierso|sheriffduty ^ put your seatbelts on 14:24:57 <jlund|buildduty> use1 is all over the place 14:25:31 <•Callek> wooo-hooo 14:25:33 <nagios-releng> Wed 14:25:33 PST [4947] buildbot-master01.bb.releng.use1.mozilla.com:ntp time is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. (http://m.mozilla.org/ntp+time) 14:25:53 <KWierso|sheriffduty> jlund|buildduty: closure-worthy? 14:27:12 <jlund|buildduty> I don't think jobs will start pending dramatically but spurious results should be expected 14:27:21 <jlund|buildduty> I bet we will have bustage fallout of jobs 14:28:53 <nagios-releng> Wed 14:28:52 PST [4948] fw1.private.releng.scl3.mozilla.net:BGP use1 vpn-c149afa8-2 is OK: SNMP OK - BGP sess vpn-c149afa8-2 (use1/169.254.255.77) uptime 651 secs (http://m.mozilla.org/BGP+use1+vpn-c149afa8-2) 14:30:13 <nagios-releng> Wed 14:30:12 PST [4951] buildbot-master01.bb.releng.use1.mozilla.com:ntp time is OK: NTP OK: Offset 0.001215457916 secs (http://m.mozilla.org/ntp+time) 14:30:39 <arr> jlund|buildduty: was talking to dcurrado on #sysadmins, and he says that the connection looks stable now recovery 14:33:13 <nagios-releng> Wed 14:33:12 PST [4952] buildbot-master02.bb.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 1757106 Threads: 361 Questions: 724978049 Slow queries: 67078 Opens: 72831 Flush tables: 2 Open tables: 2350 Queries per second avg: 412.597 (http://m.mozilla.org/MySQL+Connectivity) 14:33:23 <nagios-releng> Wed 14:33:22 PST [4953] buildbot-master113.bb.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 1757117 Threads: 371 Questions: 724982864 Slow queries: 67080 Opens: 72831 Flush tables: 2 Open tables: 2350 Queries per second avg: 412.597 (http://m.mozilla.org/MySQL+Connectivity) 14:33:36 <jlund|buildduty> here we go 14:33:42 <jlund|buildduty> arr: I'll file to get it documented 14:34:13 <nagios-releng> Wed 14:34:12 PST [4954] buildbot-master117.bb.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 1757170 Threads: 359 Questions: 725014439 Slow queries: 67082 Opens: 72831 Flush tables: 2 Open tables: 2350 Queries per second avg: 412.603 (http://m.mozilla.org/MySQL+Connectivity) 14:34:23 <nagios-releng> Wed 14:34:22 PST [4955] buildbot-master51.bb.releng.use1.mozilla.com:MySQL Connectivity is OK: Uptime: 1757179 Threads: 362 Questions: 725019117 Slow queries: 67083 Opens: 7283
Status: RESOLVED → REOPENED
Flags: needinfo?(jlund)
Resolution: WORKSFORME → ---
Reporter | ||
Comment 5•9 years ago
|
||
Flags: needinfo?(jlund)
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(jlund)
Reporter | ||
Updated•9 years ago
|
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(jlund)
Assignee | ||
Comment 6•9 years ago
|
||
This is an ongoing problem, which could be addressed by establishing a direct connection with AWS. I am currently assembling information for a cost/benefit analysis... how much will it cost us per month for a direct connection, versus how much does it set us back when releng process fail due to problems on the Internet causing loss on ipsec tunnels. I would like to leave this bug open as this is an unresolved situation. Thanks.
Assignee: nobody → dcurado
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 7•9 years ago
|
||
Am closing this bug, but have a pointer to it in https://bugzilla.mozilla.org/show_bug.cgi?id=962679 which I will be using to track the issues/pain causes by using VPN access to get to AWS, as well as tracking costs for direct connect.
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → INCOMPLETE
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•