Closed Bug 1429546 Opened 6 years ago Closed 6 years ago

[ops infra socorro] loadtest webapp in new infra

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

The new infrastructure is different in a few ways from the old infrastructure. We should loadtest the webapp nodes in the new infrastructure.

This bug covers writing up a loadtesting plan for the webapp nodes that we'll use for -stage and -prod.
Grumpy: Is this something you want to take on? We're thinking we'll need this in February plus you'll probably get more familiarity with Socorro out of it.
Flags: needinfo?(chartjes)
Yes, would be glad to help out with this effort in February.
Flags: needinfo?(chartjes)
QA Contact: chartjes
As mentioned in an IRC conversation with willkg, I require some time with someone familiar with the system to determine what end points need to be hit for a load test.

Who would be the appropriate person for this project
Flags: needinfo?(willkg)
Definitely worth looking at the e2e-tests and basing the load test on that.
Flags: needinfo?(willkg)
I traded emails with Chris just now. I'm going to take this on and try to get it done this week.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
I threw together a test plan based on what Chris started:

https://docs.google.com/document/d/1d-WqjrzMhjwMSzr_TFoyA4gYy7KF0p0Qld7wR0pDLQI/edit#

I fleshed out the socorro-load-tests code:

https://github.com/willkg/socorro-load-tests

I ran a test-the-load-test test and following that a 1 hour test against -new-stage.

Copying from my email to socorro-dev:

"""
Last night, I picked up where Chris left off and threw together a
rough load test plan for the webapp. Then today, I took the code that
Chris had started, and fleshed it out to a point where I could run a
test-the-load-test test and a rough load test to get a feel for what
things looked like.

Webapp load test plan is here:

https://docs.google.com/document/d/1d-WqjrzMhjwMSzr_TFoyA4gYy7KF0p0Qld7wR0pDLQI/edit#

During the test-the-load-test test, I determined I can probably get
enough req/s from my laptop that it was sufficient to run it from
there. Then I did a 1 hour load test running from my laptop against
the -new-stage webapp nodes.

Short summary:

1. the webapp in -new-stage exceeds the 1x (3 req/s) and 3x (9 req/s)
targets--peak of 14 req/s
2. Datadog graphs suggest the webapp cluster is scaling nicely--scaled
at 10m and 20m for a total of 4 nodes
3. there weren't any errors in Sentry or non-200 responses

The only concern is that 0.9% of the requests timed out. The load test
code sets a timeout of 10 seconds which isn't very long. It's not
clear why these requests timed out (connection? waiting for response?
ES garbage collection?).

More details on the 1 hour from my laptop load test:

https://docs.google.com/document/d/1d-WqjrzMhjwMSzr_TFoyA4gYy7KF0p0Qld7wR0pDLQI/edit#heading=h.2swl8ar501gi

My thoughts at this stage:

I think there's enough evidence to suggest we're fine and that we
don't need to do anything further.


Does that make sense to you? What's important to pursue further?
"""

I think this is good enough, but will iterate further if there are interesting things we should pursue. I'll keep the bug open until there's consensus we should be done.
Mike and Miles raised an eyebrow at the FAILURES/Timeouts.

I did another iteration of load testing focusing on those and comparing -new-stage to -stage. The timeouts are definitely timeouts, though it's not clear what specifically is timing out. Both environments show timeouts during a loadtest, but -stage is significantly worse than -new-stage.

I'm not entirely sure how to differentiate between the various types of timeouts. aiohttp returns TimeoutError() with no explanation. I think to figure out more, I'd have to switch tools. It's a mystery, but I think we're ok with leaving it mysterious.

Everything else about the load test looks fine. Given that, I'm going to mark this FIXED.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.