Closed Bug 1793984 Opened 2 years ago Closed 1 year ago

eliot runs out of memory

Categories

(Eliot :: General, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(8 files)

Eliot occasionally hits out-of-memory errors when loading symcache files from disk.

https://sentry.io/organizations/mozilla/issues/3641895776/?project=6700123

MemoryError: null
  File "falcon/app.py", line 365, in falcon.app.App.__call__
    responder(req, resp, **params)
  File "markus/main.py", line 515, in _timer_decorator
    return fun(*args, **kwargs)
  File "eliot/symbolicate_resource.py", line 648, in on_post
    results.append(self.symbolicate(stacks, modules, debug_stats))
  File "eliot/symbolicate_resource.py", line 432, in symbolicate
    self.get_symcache(module_info, debug_stats)
  File "eliot/symbolicate_resource.py", line 328, in get_symcache
    data = self.cache.get(cache_key)
  File "eliot/cache.py", line 161, in get
    data = self.read_from_file(filepath)
  File "eliot/cache.py", line 103, in read_from_file
    data = fp.read()

That would explain the recent reports where people couldn't get symbols from the server! I thought it was conference WiFi issues.

There are only 18 instances of out-of-memory events and all between October 3rd and October 6th. Are there bugs for the reports you mentioned?

The two reports I am aware of were before Oct 3rd. The first was on Matrix here, on Sep 27. The second was on Sep 28/29 in person. So then I guess it was a different issue after all.
(What happened in these reports is that, once the request to the server fails, we fall back to local symbolication, which doesn't have file names + inlines. We don't display an error in this case because the local fallback worked.)

I found a second Sentry event:

https://sentry.io/organizations/mozilla/issues/3626762529/?query=is%3Aunresolved

It's essentially the same as the first, but it's been going on since September 28th. So that might have contributed to some of the problems you saw.

Also, there's another problem we're hitting which I wrote up in bug #1791785. gsvelto fixed that in dump_syms, but (as of now) there are 81 instances of that problem dating back to September 21st. I'll follow up on fixing those sym files by hand in that bug.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

Prior to september 8th, Eliot memory usage is pretty static and hangs out at around 35%. Starting september 8th, Eliot eats memory until it hits around 80% and it has spikes that hit 100% which is when we see memory errors.

Nearby deploys:

  • 2022.09.01: bug #1788705, updated dependencies, update symbolic to support inline functions
  • 2022.09.15: bug #1791001, update dependencies

2022.09.01 is well before september 8th, however, I think what's going on is that I got all the inline function support landed and then we started seeing sym files with inline function data. The inline function data makes sym files larger (see bug #1790808) and I think that's causing something in Eliot to function poorly.

Next step is to see what Eliot is caching in memory and then see what we can do about it.

Depends on: 1796120

I can reproduce the issue on the stage server.

  1. generate a set of symbolication jobs from crash reports from Crash Stats using make-stacks.py
  2. call the symbolication API for stage using those stacks

After about 40 requests, the instance has hit 100% memory and starts throwing MemoryError exceptions.

There's nothing in the logs except for the MemoryError exception. On Sentry, there's a MemoryError event like the one in this bug description.

I put together a dashboard with memory, disk usage, and requests graphs. Here's the period of time I was testing against stage:

https://earthangel-b40313e5.influxcloud.net/d/wtG190NVk/tecken-eliot-memory-usage-analysis-2022-10-27?orgId=1&from=1666918732868&to=1666923370851

A few things to note:

  1. 40 requests to hit a MemoryError sucks
  2. the memory usage never goes down afterwards--the instance never recovers
  3. disk usage never goes down, but that's expected--it's being used as an LRU cache

I added memory usage profiling to the code and ran it on my local machine and symbolicated the same set of stacks. I don't see evidence of memory leaks in either the Eliot webapp process, the disk cache manager process, or the docker container. I also can't reproduce the issue locally.

The Grafana dashboard memory usage graph shows memory usage for the instance. We don't have more granularity than that at the moment.

I was wondering if we were using a ramdisk, but I don't think we are. The max cache size is set to 40gb and the instance only has 8gb of memory. Also, I don't see any evidence of us setting up a ramdisk in the infrastructure code.

I don't have a working theory as to what's going on. I'm pretty sure it's related to the change in sym file size, but I'm not sure how.

I want to rewrite how symbolication works. Right now it goes through the symbols in order and loads symcaches as it needs them. This means that a symbolication process is holding multiple symcaches in memory at the same time. I want to rewrite it to sort the symbols by module so it only has one symcache file in memory at any given time.

That's a good performance improvement to do regardless, but I would like to have a running theory about what's going on here.

Priority: P2 → P1

I spent too much time on this and can't figure out what the problem is, but I'm pretty sure it's in symbolic. I whittled everything down to a short script that I think exemplifies the problem and wrote up an issue.

https://github.com/getsentry/symbolic/issues/706

I need to switch gears, so I'm going to rewrite the symbolication code to reduce the number of modules it has in memory at once. Hopefully that helps. Then we'll set gunicorn max_requests and jitter and the combination of those three things should stabilize Eliot.

mstange: If you have any thoughts on that symbolic issue and/or could chase that down, I'd be much obliged!

willkg merged PR #2638: "bug 1793984: drop max requests to 750" in 8ae1076.

We should change the values after things are more stable.

I sent a ton of requests to stage and the memory usage is more stable now with fewer workers and having them recycled regularly. I think this is a good enough temporary stopgap fix until we get a better fix in.

Duplicate of this bug: 1798109

All the changes so far were deployed just now in bug #1798788. The gunicorn configuration changes I made as a temporary stop gap will either eliminate the out-of-memory errors or at worst radically reduce them. I'll keep an eye on that.

Next step is to rewrite the symbolication code so it has at most one module in memory at any given time and also to stop saving sym files to disk before parsing them into symcache files. I'm working on that now.

Sebastian at Sentry said this:

We've been experimenting with this for a while and can't definitively figure out where the memory is going. Our best guess is that python doesn't immediately give the memory back when the garbage collector frees objects. Note that if you e.g. run the script on a directory containing several copies of the same file, reported memory usage will stay approximately constant.

https://github.com/getsentry/symbolic/issues/706#issuecomment-1307304152

Given that, I think we should make two changes in this bug:

  1. redo symbolication so it has at most one module in memory at a time
  2. run sym -> symcache conversation as a subprocess like we do with the stackwalker in socorro

Everything merged so far was deployed just now to prod in bug #1803625.

I'll keep this open until next week to see how that fares in production.

Eliot in stage and prod looks good now. Memory usage is well below 50% on average. Spikes don't cause the instance to run out of memory. Everything looks ok for the current load they're getting.

I reviewed instance sizing in bug #1803359 and concluded we shouldn't make instance changes.

I don't think we need to make architecture changes, either.

Given that, I think we're done here for now.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED

Moving to Eliot product.

Component: Symbolication → General
Product: Tecken → Eliot
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: