Closed Bug 1795017 Opened 2 years ago Closed 5 months ago

processor lags causing processing queue build up

Categories

(Socorro :: Processor, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

(Depends on 1 open bug)

Details

Attachments

(2 files)

There have been several incidents where the processors are lagging which causes processing throughput to drop and then the processing queue builds up until it hits the threshold for the Socorro Visible SQS Messages alert and pages people.

  • September 13th at 19:50 UTC
  • September 27th at 7:30 UTC
  • September 28th at 13:45 UTC

This covers figuring that out.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

Harold, Christina, and I had multiple conversations and I through those observations and some other stuff into a Google doc:

https://docs.google.com/document/d/1Pw0KR2sqwe5R9EoHOI1ABtcbQOcZGeNbsCiUHKJBH9g/edit#

I wrote up a bunch of data in that document with graphs and figures.

DSRE jira ticket: https://mozilla-hub.atlassian.net/browse/DSRE-1066

We decided to change the average cpu amount that triggers scaling up from 75% to 60%. I did a change in the cloudops-deployment repo.

PR: https://github.com/mozilla-services/cloudops-deployment/pull/4719

This was pushed to prod just now in bug #1797179.

The number of processors went down to 2 now that the manual stopgap is wiped out because of the deploy. I'll keep an eye on this over the next few days and see if we have any problems with scaling.

We got alerted again today with a queue that eventually got to 15k and almost 2 hours of lag between collection and end of processing. There were four processors and when the queue started backing up, they didn't scale up.

Harold wants to change the scaling trigger to trigger off the queue size. That makes sense to do--right now we're using the average cpu as a proxy for "there's a lot of work going on" and that's all out of whack. This is work, though.

We're continuing to use a stopgap of raising the minimum number of processors.

Bug #1698682 covers changing the processor app to use honcho to run a processor plus a disk cache manager. Right now, the processor process spins off the disk cache manager and if the disk cache manager dies, the instance eventually runs out of disk and then flails around (but doesn't die). This would change it so that the processor process would die and then the instance would die. I'm not sold on honcho, but I'm not seeing other process manager options.

Bug #907277 covers changing the processor process to be multi-process instead of multi-threaded. Multi-process is a better architecture for the processor which is spending a lot of time in i/o wait now. It would also allow the processor process to take advantage of multiple CPUs. A first step towards that is to use whatever we use for bug #1698682 to run multiple processor processes with a single thread for each.

I'm planning something along these lines:

  1. continue the stopgap of having a large minimum number of processors for now
  2. in January 2023, implement some version of bug #1698682 and bug #907277 and then see where that puts us
  3. either change how scaling works to be based on the queue OR continue the stopgap until we switch to GCP
Depends on: 1698682, 907277

We're still having problems with the queue backing up.

I did a PR to raise the minimum number of processors to 10 so we can stabilize before the end of December giving us time to implement a real fix next year.

https://github.com/mozilla-services/cloudops-deployment/pull/4749

The processor scaled up and worked through the queue and looked fine, and then it was like they fizzled out and suddenly the stackwalker was taking a really long time to process minidumps and the average cpu dropped significantly.

Maybe we're hitting a network or esb bandwidth limit or something like that?

Next step is to create a Grafana dashboard to look into infrastructure and runtime metrics.

I looked at metrics for a while and didn't see anything. After that, we decided to switch how the processor works and switch it from being multi-threaded (which has potential GIL contention among other things) to multi-process.

We use Honcho to run multiple processes in other services, so we decided to use that here, too. Bug #1698682 covers switching to Honcho to run the processor.

However, Honcho runs processes in a way that drops all the configman environment variables (which aren't valid shell variables), so in order to switch to Honcho, we needed to switch from configman to Everett. That's covered in bug #1529342.

In order to do that, we also needed to rewrite the disk cache manager. That's covered in bug #1673493.

Suddenly that's a major overhaul of Socorro to cover switching from configman to Everett, using Honcho to run processes, and rewriting the disk cache manager.

That's mostly done now and is sitting on stage. It covers all the work we need to do in order to switch Socorro from multi-threaded to multi-process, but doesn't include that final switch. Once that lands, then we'll see where things are at and whether the processor still has the performance characteristics that prevent it from scaling that we're currently having or not.

I pushed all the changes to production. I updated the incident report. I updated the production dashboards.

Next step is to remove the stopgap that sets the minimum number of processors to 10. Then we can see if the processor CPU usage numbers are in line with what we expect and what we can write a scaling rule for.

If that doesn't work, then I'll look into switching the processor from a multi-threaded service to a multi-process service.

We changed the minimum number of processors to 4 and I tried reprocessing 1,000 crash reports. That should not be a big deal. However, we saw exactly what we've been seeing with this bug:

  1. initially, each processor node CPU usage_user goes up and the processor throughput goes up, but then that slides down until it's lower than it was before it was under load
  2. meanwhile CPU io_wait climbs until it's the state the CPU is spending most of its time in
  3. also, stackwalker timings go way up--probably because it's waiting while downloading sym files
  4. and processor throughput goes down significantly
  5. because the CPU is not high enough, it doesn't scale up and add more nodes
  6. the backlog increases ad infinitum

One of my working theories is that this was a problem with the Python GIL and contention between threads. However, I don't think that theory is holding. It wouldn't manifest itself in the way we're seeing.

I started to look at network throttling again. The stackwalker makes HTTP requests to symbols.mozilla.org which (if the sym file exists) returns a redirect to an AWS S3 bucket url and the stackwalker downloads the sym file from the AWS S3 bucket.

I looked at rate-limiting and throttling in symbols.mozilla.org. We have some burst-related rate-limiting in nginx, but nothing else. I set up a load test to see if I could trigger 429s or slowness and couldn't. I'm pretty sure it's not symbols.mozilla.org.

Are we getting rate limited by AWS S3 somehow? Or maybe network usage for Socorro in the prod environment or the nodes it's using? I'll look into those next for a bit. But I'm thinking we're hitting a point where we should just leave it as is and then in the GCP migration see if we have the same problems in GCP and, if so, fix it there.

I'm not going to get to the rest of this anytime soon, so taking it out of my queue.

Assignee: willkg → nobody
Status: ASSIGNED → NEW

We're having another one of these right now.

I spent some time looking at infrastructure-related graphs and I built a new analysis dashboard:

https://earthangel-b40313e5.influxcloud.net/d/FMW5L0e4k/socorro-processor-analysis-2023-08-09?orgId=1

I have a new theory on what's going on. I think what we're seeing is this:

  1. processor load increases until it hits some point P
  2. at point P:
    1. stackwalker rule timings go up significantly (from 6s mean to 30s mean); it's taking longer to process minidumps
    2. thus the processor takes longer to process each crash report--throughput drops
    3. thus we drop below the line where the number of incoming crash reports is now greater than the number being processed and the queue backs up

So what's going on after we hit point P?

  1. instance memory usage stays the same
  2. instance disk usage stays the same
  3. cpu user usage mean goes down from ~20% to ~15%
  4. cpu iowait mean goes from ~15% to ~50%
  5. diskio reads drop from ~35 to ~22 (nb: I don't know what the unit of measurement is)
  6. diskio read_time goes up from ~500 to ~1800 (nb: I don't know what the unit of measurement is)

CPU iowait goes up. But CPU iowait is really about disk io--not network io. We don't see a huge change in network in/out during this time. Earlier theories I had about how this is probably related to downloading data from S3 don't hold up.

That the number of disk reads goes down, but read time goes up feels related to the fact that sym files increased in size dramatically in September 2022 when these problems started.

It sure feels like we've got too many stackwalkers running simultaneously for this instance build and they're getting bottlenecked on disk reads. The new theory is that when the system is under load, it slows down because it hits a bottleneck with disk reads and now has stackwalkers waiting for data on disk tying up all the processing threads. Because they're waiting, CPU usage drops and the processor nodes don't scale up.

Back in March 2023, we dropped the number of threads from some number (5?) to 4. I think we should drop the number of threads to 3 and see what that does to throughput and other performance numbers.

We're seeing processor lag again today. Yesterday we saw more crash reports than we normally get. Here's the last two weeks:

date -- Fenix Firefox Focus MozillaVPN ReferenceBrowser Thunderbird total
2023-07-27 00:00:00 0 31868 59680 787 0 0 15926 108261
2023-07-28 00:00:00 0 32122 58124 800 1 3 14494 105544
2023-07-29 00:00:00 0 33355 43744 871 0 2 6503 84475
2023-07-30 00:00:00 0 33943 43406 821 1 0 6146 84317
2023-07-31 00:00:00 0 33043 71756 783 0 0 17161 122743
2023-08-01 00:00:00 0 32347 75260 715 1 0 16333 124656
2023-08-02 00:00:00 0 31590 86896 734 0 0 16399 135619
2023-08-03 00:00:00 0 30069 88872 769 3 0 16427 136140
2023-08-04 00:00:00 0 28756 84441 682 0 0 15102 128981
2023-08-05 00:00:00 0 29602 64257 603 0 0 6679 101141
2023-08-06 00:00:00 0 29604 61566 629 0 0 6518 98317
2023-08-07 00:00:00 0 28915 94919 564 0 0 17240 141638
2023-08-08 00:00:00 0 28899 93426 592 0 0 16126 139043
2023-08-09 00:00:00 0 46217 97458 1081 0 0 17144 161900
2023-08-10 00:00:00 0 23785 48605 482 0 0 8815 81687 (half day)

I'll look at adjusting the number of threads next week. I'll get SRE to add some additional processors for today and we can see how the numbers continue.

I wrote up a DSRE issue to temporarily add instances to the processor cluster. Mikael added 5 bringing the total up to 15 and bringing processing throughput up to >200 per minute. The processing queue started going down. But then the cluster scaled back down to 10 terminating 5 old instances. Curiously, the processing throughput remained at >200 per minute and there's no indication of increased iowait. Ergo, recycling 5 instances returned the cluster back to normal.

Maybe there are zombie processes around? Maybe there's thread/gil contention in the processor process? Maybe there are deadlocks on cache files? Maybe something else that's accumulating or running out over time?

In the meantime, I don't think it makes sense to drop the number of threads from 4 to 3. If we have a backup, it probably makes sense to do a prod deploy to get fresh instances.

We're still seeing an increased volume of crash reports and the processor queue is backed up even with 20 instances. They're all sluggish. Further, we're seeing an increase in number of stackwalker processes getting killed off because they exceed the 2 minute timeout.

My current theory is that the stackwalkers are bottlenecked on disk i/o and reducing the number of stackwalkers (or getting a faster disk) will improve throughput. Even if it isn't disk i/o, it's probably something and reducing the number of stackwalkers on an instance will help.

We're in a change freeze, so the least complex thing we can do to get through the next two weeks is best.

I'm going to make that change now.

Despite outages with CircleCI and GitHub, I changed the number of threads (and thus the maximum number of concurrent stackwalkers) from 4 to 3 and pushed out a minor dependency update change to trigger a stage deploy.

I've been throwing load at stage and it's processing just fine. I'm looking at the cpu iowait, user, and system lines, the throughput, disk usage, and cache evictions and it's pretty steady. I don't know what stage was like before we dropped the threads from 4 to 3, so I don't have a baseline.

Comment #9 suggests we can reproduce the issue by reprocessing 1,000 crash reports in prod. I've reprocessed 2200 crash reports with stage and it's been processing just fine. I think this change will be fine. I don't think it creates new problems. I think it may alleviate/fix our existing problem.

I'm going to push this to prod and keep an eye on it for the rest of the week.

See Also: → 1849708
See Also: → 1849352
See Also: → 1848831

Switching from 4 threads to 3 didn't fix the problem--we're still seeing processor instances go into a fugue state from which they never recover.

I was throwing ideas around with Lonnen and separately with Harold. Thoughts:

  • thread contention in the processor process (it's a multi-threaded Python application)?
  • zombie stackwalker processes accumulating over time?
  • hitting an open files limit?
  • some combination of the above

Can we (ab)use the disk cache manager or the main processor process to send health metrics and shed some light on the above?

Depends on: 1850428
Depends on: 1848002
Depends on: 1850750

I talked with Harold and we wondered if there were "things" that accumulate on the processor instances that cause it to slow down over time. Maybe zombie or orphaned processes.

I rewrote the code that runs the stackwalker as a subprocess so it involves one fewer child process and makes it less likely we orphan processes. (bug #1850428).

One of the long-standing needs here is more metrics related to what the stackwalker is doing. I wrote up and implemented changes for bug 1848002 adding a "heartbeat" to the processor main thread which emits metrics on processes by status (running, sleeping, etc), processes by type (shell, python, stackwalker, etc), and total number of open files. That work was implemented in bug #1848002.

All that deployed to stage. Through reprocessing crash reports, I was able to get stage into the scenario we're having problems with where processing throughput drops off and never recovers.

https://earthangel-b40313e5.influxcloud.net/d/mTjlP_8Zz/socorro-stage-megamaster-remix?orgId=1&from=1693359450845&to=1693366730184

The processes-by-stage graph shows processes in running state drops and processes in disk-sleep state goes up. Number of open files doesn't do anything weird. Total number of processes doesn't do anything weird and there aren't zombie or orphaned processes showing up.

I talked with Gabriele and he wondered whether the stackwalker was getting into a fugue state where it was hitting some error saving a file to disk and got into a loop downloading it over and over again. There's no evidence that's the case--we would expect to see a rise in network traffic, but there isn't one. This seems unlikely.

I added disk read/write and network activity graphs to the processor analysis dashboard. When the processor enters a fugue state, the number of read iops drops, but the total time spent disk reading and writing goes up.

I wondered if over time, the disk cache becomes predominantly large xul files which are popular (windows firefox release channel, esr, etc) and having a bunch of them means the stackwalker is spending more time reading them.

I changed the disk cache manager to have a heartbeat and emit metrics about the size of the files in the cache every second. That's covered in bug #1850750. I'm following up on that last theory now.

I had a long talk with Relud about this today. He pointed out that when the processor instances go into a fugue state, the number of disk read and disk write iops drop to a constant number. Then we looked at the AWS console and this is an m5.xlarge which does not have a local ssd and has EBS bandwidth "Up to 4,750".

So the theory is that since September 2022, the process is doing a lot more disk iops and this instance type has a network disk and it's hitting the upper limit of EBS usage and getting throttled.

Relud suggested switching to an m5d.xlarge which has a local disk. I'm going to change that now and push that to stage. Then we need to see if we have to format the disk and change other settings because of how it's mounted.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

We switched the instance type to an m5d.xlarge and changed the symbols cache to be on the local disk. I put a lot of load on stage yesterday and everything looked ok.

In December 2022, we reduced the number of processor worker threads from 5 to 4. Then in August, we dropped it from 4 to 3. Today I changed it from 3 to 5--so we're back to the original configuration. I'm going to put load on stage today to see how that goes.

If all goes well, we can deploy this to prod. I think this fixes all the performance issues we were having and that the CPU will get high enough when the system is under load to trigger the scaling policy. I'll test that out next week.

Stage scaled up and back down under load, so I changed the autoscaling values to roughly what we had back in December 2022. That's in stage now (although the values don't affect stage).

We pushed all the changes to prod yesterday in bug #1862722. Everything looks good. There were several times where the backlog grew from a sudden influx of reprocessing requests, the processor group scaled up and then down after the backlog was clear. CPU usage looks better. Throughput looks better.

https://earthangel-b40313e5.influxcloud.net/d/FMW5L0e4k/socorro-processor-analysis-2023-08-09?orgId=1&from=1698950399867&to=1699035465035

Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: