Closed Bug 1438390 Opened 6 years ago Closed 6 years ago

[ops infra socorro] verify production monitoring/alerting for -new-prod

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: miles, Assigned: brian)

References

Details

I've configured production monitoring and alerting for Socorro.

Once we cut over, we'll assume the same domain names used by the existing infra, so Pingdom monitoring that is already in place is sufficient.
  - crash-reports.mozilla.com
  - crash-reports.mozilla.com/__heartbeat__
  - crash-stats.mozilla.com
  - crash-stats.mozilla.com/monitoring/healthcheck/
  - crash-stats.mozilla.com/monitoring/crontabber/

I've created a Service in Pagerduty (currently, temporarily set to low priority such that it only sends email alerts).

I've created a set of monitors in Datadog: https://app.datadoghq.com/monitors/manage?q=socorro -new-prod&

Those monitors are all configured to alert to the service in Pagerduty, and I have tested that.

This bug covers verifying that the monitoring is sufficient as configured and discussing potential additional monitoring.
Here is what I see in datadog:

# our code #

## collector (antenna)
* large s3 save queue
* low request count
* cpu usage high (email)
* disk space low
* memory usage high
* network in high

## lambda (pigeon)
* low invocation rate
* high error rate

## processor
* cpu usage high
* low save_raw_and_processed
* disk space low
* memory usage high

## crontabber
* disk space low
* memory usage high

## webapp
* disk space low
* memory usage high

# our datastores #

## elasticsearch
* cluster red
* memory usage high
* jvm heap usage high
* heap usage is high
* low ELB request count

# managed services #

## rabbitmq
* n/a

## RDS
* high cpu usage
* low free space

Here are a few ideas for changes we could discuss.

* Create forecaset alerts (https://www.datadoghq.com/blog/forecasts-datadog/) for running out of disk space for RDS and EC2 instances and have them email us a week in advance. Running out of disk is usually predictable well in advance. I'd probably kill the existing alert for everything except for RDS, since everything else can tolerate some instances running out of space.

Get alerted when we are nearing capacity on RDS and ES, which are a bit trickier to scale, further in advance by

* lower ES and RDS cpu thresholds to 70%
* add iops alert for rds (aws.rds.diskio.tps) and elasticsearch (system.io.r_s + system.io.w_s) based on what volume size and type should provide
* add network bandwidth alerts for rds (aws.rds.network_receive_throughput, aws.rds.network_transmit_throughput) and elasticsearch (system.net.bytes_rcvd, system.net.bytes_sent) based on what instance type should provide

* make the es cluster health alert us on sustained yellow instead of red. cluster shouldn't remain yellow for more than a few minutes, so this would tell us we needed to intervene to bring it back to a fully healthy state. Once it's read, we've already lost data, and we can't write to the affected index(es) until we've sorted out whatever caused the cluster to go read and restored the affected index(es) from backup.

* Does the crontabber report metrics for job success or failure? It could be nice to get notified if jobs start failing.

* I would make almost everything an email alert except for the alerts related to the work the system is doing, e.g. is the collector saving data, is pigeon queuing it for processing, is the processor processing it, is the webapp serving requests? Those I would have page. I think you've measured them well via

* collector: large s3 save queue
* collector: low request count
* collector: High ELB 5xx and backend 5xx
* pigeon: low invocation rate
* pigeon: high error rate
* processor: low save_raw_and_processed
* webapp: High ELB 5xx and backend 5xx

Pingdom will help handle the "is the webapp serving requests" bit since the workload is too variable for a low request count alert.

The only other alerts i would have page are

* elasticsearch: cluster health
* rds: disk space low

Since they reflect impending problems that are difficult to recover from.

* I'm guessing you have the separate paging CPU alert on processors to let us know that we've scaled processors to the max allowed by their ASG and still have enough work to keep them busy. I don't think this alone reflects a problem worth paging for. We need to know if the the scaled-up processors have started to process crashes faster than we're receiving them or not . I think we could distinguish this with a composite alert (https://docs.datadoghq.com/monitors/monitor_types/composite/) over processor cpu is high *and* the derivative of the queue length in cloudampq is positive.
This has been done for a while, I just forgot to resolve it.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.