Closed Bug 1130242 Opened 9 years ago Closed 9 years ago

request for throughput data on the SCL3 ZLBs for the past 12 hours

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: dcurado, Assigned: cliang)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/550] )

Hi,
this evening releng had trouble downloading images from ftp.mozilla.org
and ftp-ssl.mozilla.org.  They got in touch with netops as they thought
perhaps it was a network problem.

But they were seeing problems from releng slaves in SCL3 as well as slaves
in AWS.  

I checked all network links within the data center and could not see
a problem.  

This got me wondering about the ZLBs, as I know we have a license limitation.
Can we get the throughput data?
Thanks very much.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/550]
I don't know that we have throughput data in the format that you want.

I usually end up looking at the ZLB graphs:
 * current activity: 
       https://zlb1.ops.scl3.mozilla.com:9090/apps/zxtm/index.fcgi?section=Monitoring
 * historical activity:
       https://zlb1.ops.scl3.mozilla.com:9090/apps/zxtm/index.fcgi?section=Statd

Current gives you some idea if we're (currently) running into the license limitation.  The historical data doesn't give as clean an overview.  That graph is drawn from statd logs; if you have something that can crunch through those, they can be found in zlb1.ops.scl3 zlb1.ops.scl3.mozilla.com:/usr/local/zeus/zxtm/log/statd/


Otherwise, the load balancers are supposed to write out to the logs if the bandwidth limit is hit.  I went to each of the load balancers serving ftp.mozilla.org traffic and searched for "bandwidth" or "license" in the errors log (kept in /usr/local/zeus/zxtm/log/).  I couldn't see any evidence that it had been triggered.  (I did see some adjustments made to the ftp protection class on Dec. 4th.)
Thanks for getting back to me about this.

Maybe we can take a step back on my request...

It does not happen often, but now and then there is a lot of traffic hitting the ZLBs, 
and we may run into our license cap.  People experience this as a "network problem",
and come to the netops team.  It would be quite helpful if we could look at the 
ZLB throughput and say, "ah ha, we're hitting our license cap"  
Possible?
Thanks again.
Right now, the only way to catch this is to look at the current activity graph and do the maths and/or see if the event log has been triggered.

I experimented briefly with having the ZLBs send email when if a bandwidth event got triggered but the email never came through.  I'll try to resume testing next week (since it is a No Change Friday (tm)).  If I'm successful, we can look at the best way for looping in NetOps / MOC.
Assignee: server-ops-webops → cliang
I've done some tweaking and I think the load balancers can now be configured to alert if certain traffic license limits are reached.  If you go to one of the load balancers and hit "Sytems" -> "Alerting" -> "Manage Actions", you can see what alerting options are available.  Feel free to add an alerting method that works for you / netops; if you have more questions, etc. about trying to implement an alert, ask away. =)

The event "group" <Traffic License Limit Problem> includes the following events:
  * tpslimited: License key transactions-per-second limit has been hit
  * ssltpslimited: License key SSL transactions-per-second limit has been hit
  * bwlimited: License key bandwidth limit has been hit

Right now, the only alert that is sent out is an email to me (the alert <E-mail C>).
Could I ask you for a favor, and you can you add netops@mozilla.com to that email alert?
See Also: → 1136195
Set up a separate alert called <Notify Netops>.  The <Traffic License Limit Problem> "group" should now trigger the <Notify Netops> alert in addition to <E-mail C>.
Closing this, as we'll now get email when we're hitting the ZLB throughput thresholds.
Thanks!
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
thanks! this is amazing.
<cyliang>	nthomas, dcurado: Ugh. so, yes, confirmation that the log event doesn't isn't getting generated when we hit 2Gbps out.
<cyliang>	nthomas, dcurado: will look at seeing if we can trigger this from somewhere else (maybe graphite)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
If we can get the alert piped into #buildduty in IRC in addition to wherever it needs to go for IT, it will head-off a lot of digging when things are timing out in CI.
See Also: → 1140489
Closing this bug (which is a request for throughput logs) in favor of bug 1140489, which is explicitly about creating a Nagios check for bandwidth.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → INCOMPLETE
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.