Closed Bug 1429534 Opened 6 years ago Closed 6 years ago

[ops infra socorro] loadtest processor in new infra

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Unassigned)

References

Details

We're putting together a new infrastructure. One of the things we want to verify is loadtesting the processor and autoscaling rules for it.

This bug covers figuring out a loadtest plan for the processor in the new infrastructure.
We can do this during the crash copy between the old crash bucket and the new crash bucket. Every crash copied by our S3DistCp [0] setup will trigger Pigeon, which will put the crash in the queue, and the processor will have to handle it.

I've also tested the autoscaling of the processor in -new-stage by allowing a queue to build up. The processor successfully scales up to work through the queue, and then scales back down.

[0] https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html (covers EMR but also general distcp)
This week, Miles did a few rounds of S3DistCp which loadtested the shit out of the processor. Dumping files in S3 triggers pigeon which tossed crash ids into the queue with reckless abandon. We saw the socorro.normal queue spike to 1.5 million. The processor CPUs hit 100, it autoscaled up to like 20 processors and worked through the queue.

Relevant Datadog timeline: https://app.datadoghq.com/dash/405076/socorro-new-stage?live=false&page=0&is_auto=false&from_ts=1520646583709&to_ts=1520983669331&tile_size=m

We didn't see any Sentry errors outside of normal processor issues.

After getting through the crash queue, I ran our socorro-compare scripts to compare ADI, product versions, raw crash, and processed crash data between -stage and -new-stage and everything looked super.

-new-stage processed 2.8 million crashes in a 3 day period. For comparison, a normal week for -prod is 1.5 million and since the beginning of this year we've had a couple of spikes of 2.8 million over a 7 day period.

I think we can call this a successful load test.

Miles: Anything youw ant to add?
Flags: needinfo?(miles)
We're good here.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(miles)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.