Closed Bug 1060416 Opened 10 years ago Closed 10 years ago

packet loss between usw2 and scl3

Categories

(Infrastructure & Operations Graveyard :: NetOps, task)

x86_64
Linux
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: dcurado)

References

Details

Attachments

(2 files)

Over the past week or so we've started seeing worse network performance to/from usw2.

Trees are currently closed for this.

http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-usw2

e.g. from buildbot-master115.srv.releng.usw2.mozilla.com to runtime-binaries.pvt.build.mozilla.org I get this from mtr:

 Host                               Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 169.254.249.1                    0.0%    23    0.7   0.6   0.4   0.8   0.1
 2. 169.254.249.25                   0.0%    23    0.6   3.9   0.4  47.5  10.2
 3. 169.254.249.26                  13.0%    23   36.8  38.4  35.5  41.5   1.6
 4. v-1033.border1.scl3.mozilla.net 21.7%    23   36.7  38.6  36.6  43.1   1.7
 5. v-1029.fw1.scl3.mozilla.net     13.0%    23   38.0  38.9  36.3  40.5   1.0
 6. releng-zlb.dmz.scl3.mozilla.com 21.7%    23   40.5  38.8  36.0  40.5   1.4


See also https://bugzilla.mozilla.org/show_bug.cgi?id=975438
Assignee: network-operations → dcurado
looking at this
Status: NEW → ASSIGNED
Had a chat with webops, everything appears to be fine on the Zeus side (not reaching any caps etc).
I'm still unable to reproduce this lossyness.

In the mtr above, the loss is occurring on the IPSec tunnel from AWS -> Mozilla.

Is the amount of traffic we're sending from AWS back towards SCL3 increasing?
i.e. This doesn't look like an circuit that has errors on it, because the
problem comes and goes.  And, we aren't anywhere near to saturating our
Internet access that the IPSec tunnels ride over.  That leaves me 
wondering if we're just sending more traffic over the IPSec tunnel and
running into loss due to CPU overload.
If anything it's been decreasing, see https://observium.private.scl3.mozilla.com/device/device=32/tab=port/port=3226/. The CPU graph linked on that page doesn't show upward changes in the last week or so either.
Here's a follow up on this issue:
 - we were seeing loss from the packets going from AWS to Mozilla's SCL3.
 - There are two ipsec tunnels between those two end points, one active, one non-active.
 - Apparently there has been some luck is "flipping the tunnels" when this sort of traffic
   loss has happened in the past.  
 - The releng folks affected by this problem suggested I try flipping the tunnels, and once 
   I understood what that meant, was prepared to do it.
 - However, doing so would reset a bunch of TCP sessions, which would cause some problems
   for releng.
 - They asked that I wait, and allow them to move some of their ~1100 VMs to another AWS site.
 - After they have moved between 200 and 300 VMs to another AWS site, the traffic loss issue
   disappeared. 
 - So the problem appears load related.
 - However, we (Mozilla) have not been increasing traffic between AWS and SCL3.
 - So something changed that results in the connection being unable to handle the same
   load that it has been handling.  
 - My guess is that the hardware on the Amazon side of the link -- a router of some kind -- is
   now doing more work than it used to.  It is a shared resource, shared by N number of AWS 
   customers.  If Amazon loaded more customer traffic on to that device, it would result in
   the load-effected behavior we are seeing.
   (That said, the above statement is just a guess on my part.)

If this bug should not be closed, please re-open.  Thanks!
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Good luck on them admitting it is their fault.  Took my 3 months on my real job to convince at&t that a packet loss on one of our ISP circuits to them was on their end and finally getting
 it fixed.
Vendors always insist the problem is on the customer side.
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: