Closed
Bug 1060416
Opened 10 years ago
Closed 10 years ago
packet loss between usw2 and scl3
Categories
(Infrastructure & Operations Graveyard :: NetOps, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: dcurado)
References
Details
Attachments
(2 files)
Over the past week or so we've started seeing worse network performance to/from usw2. Trees are currently closed for this. http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2 e.g. from buildbot-master115.srv.releng.usw2.mozilla.com to runtime-binaries.pvt.build.mozilla.org I get this from mtr: Host Loss% Snt Last Avg Best Wrst StDev 1. 169.254.249.1 0.0% 23 0.7 0.6 0.4 0.8 0.1 2. 169.254.249.25 0.0% 23 0.6 3.9 0.4 47.5 10.2 3. 169.254.249.26 13.0% 23 36.8 38.4 35.5 41.5 1.6 4. v-1033.border1.scl3.mozilla.net 21.7% 23 36.7 38.6 36.6 43.1 1.7 5. v-1029.fw1.scl3.mozilla.net 13.0% 23 38.0 38.9 36.3 40.5 1.0 6. releng-zlb.dmz.scl3.mozilla.com 21.7% 23 40.5 38.8 36.0 40.5 1.4 See also https://bugzilla.mozilla.org/show_bug.cgi?id=975438
Updated•10 years ago
|
Assignee: network-operations → dcurado
Comment 2•10 years ago
|
||
Had a chat with webops, everything appears to be fine on the Zeus side (not reaching any caps etc).
Comment 3•10 years ago
|
||
Assignee | ||
Comment 4•10 years ago
|
||
I'm still unable to reproduce this lossyness. In the mtr above, the loss is occurring on the IPSec tunnel from AWS -> Mozilla. Is the amount of traffic we're sending from AWS back towards SCL3 increasing? i.e. This doesn't look like an circuit that has errors on it, because the problem comes and goes. And, we aren't anywhere near to saturating our Internet access that the IPSec tunnels ride over. That leaves me wondering if we're just sending more traffic over the IPSec tunnel and running into loss due to CPU overload.
Comment 6•10 years ago
|
||
If anything it's been decreasing, see https://observium.private.scl3.mozilla.com/device/device=32/tab=port/port=3226/. The CPU graph linked on that page doesn't show upward changes in the last week or so either.
Assignee | ||
Comment 7•10 years ago
|
||
Here's a follow up on this issue: - we were seeing loss from the packets going from AWS to Mozilla's SCL3. - There are two ipsec tunnels between those two end points, one active, one non-active. - Apparently there has been some luck is "flipping the tunnels" when this sort of traffic loss has happened in the past. - The releng folks affected by this problem suggested I try flipping the tunnels, and once I understood what that meant, was prepared to do it. - However, doing so would reset a bunch of TCP sessions, which would cause some problems for releng. - They asked that I wait, and allow them to move some of their ~1100 VMs to another AWS site. - After they have moved between 200 and 300 VMs to another AWS site, the traffic loss issue disappeared. - So the problem appears load related. - However, we (Mozilla) have not been increasing traffic between AWS and SCL3. - So something changed that results in the connection being unable to handle the same load that it has been handling. - My guess is that the hardware on the Amazon side of the link -- a router of some kind -- is now doing more work than it used to. It is a shared resource, shared by N number of AWS customers. If Amazon loaded more customer traffic on to that device, it would result in the load-effected behavior we are seeing. (That said, the above statement is just a guess on my part.) If this bug should not be closed, please re-open. Thanks!
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 8•10 years ago
|
||
Good luck on them admitting it is their fault. Took my 3 months on my real job to convince at&t that a packet loss on one of our ISP circuits to them was on their end and finally getting it fixed.
Comment 9•10 years ago
|
||
Vendors always insist the problem is on the customer side.
Updated•2 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•