Closed Bug 1560176 Opened 5 years ago Closed 1 year ago

switch symbols-urls to use tecken

Categories

(Socorro :: Processor, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

(Blocks 1 open bug)

Details

We have a symbols-urls configuration parameter that has a list of urls to check in order for SYM files that minidump-stackwalk needs to symbolicate stacks. Currently, we check the public symbols bucket, then the private bucket, then hit Tecken last. Hitting Tecken last allows Tecken to record the missing symbol file so it can report on what's missing that we should upload.

We want to change that to private bucket and then Tecken. This bug covers that.

I talked with John and Brian about this. It would make it a little easier to migrate Tecken to GCP so that's kind of nice. We were wondering what kind of performance hit this would cause to minidump-stackwalk and whether it'd affect Tecken. So the first order of business would be to approximate that and if it's bad, then maybe not do this.

SYM file is in public bucket

I think this is the most likely scenario that happens.

current:

  1. minidump-stackwalk checks public s3 bucket (hit)

proposed:

  1. minidump-stackwalk checks private s3 bucket (miss)
  2. minidump-stackwalk checks tecken
    1. tecken checks public s3 bucket (hit) and sends redirect
  3. minidump-stackwalk downloads from public s3

This ends up being more HTTP requests (1 vs. 4).

Socorro tends to process crashes from recent builds more often than other builds, so more of these are cached.

SYM file isn't in any bucket

I think this is the second most likely scenario.

current:

  1. minidump-stackwalk checks public s3 bucket (miss)
  2. minidump-stackwalk checks private s3 bucket (miss)
  3. minidump-stackwalk checks tecken
    1. tecken checks public s3 bucket (miss)

proposed:

  1. minidump-stackwalk checks private s3 bucket (miss)
  2. minidump-stackwalk checks tecken
    1. tecken checks public s3 bucket (miss)

This ends up being fewer HTTP requests (4 vs. 3). This is never cached, so we do this entire thing every time.

SYM file is in private bucket

I think this is unlikely--we don't have many symbol files in the private bucket.

current:

  1. minidump-stackwalk checks public s3 bucket (miss)
  2. minidump-stackwalk checks private s3 bucket (hit)

proposed:

  1. minidump-stackwalk checks private s3 bucket (hit)

This scenario is probably rare and getting increasingly rarer since we don't have many symbols in the private bucket.

This is off the top of my head. minidump-stackwalk doesn't emit any signal about cache hits/misses or how long it takes to download SYM files or where they came from. That "SYM is in public bucket" scenario is concerning, but maybe the HTTP requests that are misses and such are dominated by downloading the SYM file in which case it doesn't matter much? One way we could do this is write a simulator that goes through json_dump output for a bunch of consecutively processed crashes and tells us what the differences might be.

That seems like a lot of work. Seems better to just switch stage, see how that goes, and then approximate it based on that.

Priority: -- → P2

Dropping this to a P3. We can think about it later when we're closer to Tecken moving to GCP or some other compelling reason comes up.

Priority: P2 → P3

All the moves are done so we're not waiting on that anymore.

We should do this and see how it affects processing times. If it makes processing times worse, then maybe we don't want to do it. Otherwise, I think we should since it makes it easier to move kinds of symbols around in Tecken without having to change it here, too.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

Brian: Do you want to weigh in here? Is this ok to test out next week in stage and then prod?

Flags: needinfo?(bpitts)

Are we planning any changes to where Tecken stores symbols? If so, then I agree doing this makes sense, at least temporarily. Otherwise, I don't see any benefit from this change, but I do see risk from making Socorro's processor dependent on Tecken's availability.

Flags: needinfo?(bpitts)

I'm concerned that when we make changes to symbols locations, we (and all future maintainers) have to remember to update Socorro's configuration. I'd prefer not to have details like that littered across projects. I hear you on

Tecken has been pretty stable for a long time. I'm working on improving quality checks for Tecken so as to reduce the "stability as a fluke". Even so, I recognize that this gives additional impetus to keeping Tecken stable and up. Further, there's nothing in Socorro to indicate that Tecken was down when it was trying to process a crash and thus failed to symbolicate the stack. It'd be nice if it had something like that, but the symbolication code is in minidump-stackwalk and is complicated to work with.

Bug #1603278 is about how we're storing system symbols in with everything else which expires after 2 years, but system symbols should stick around longer. For example, Ubuntu LTS is supported for 5 years. Some people are using older versions of MacOS and Windows and Android and other Linux distributions. I was thinking we probably need to move the system symbols to another path with a different expiration like we do with try symbols.

That's the only change I've got on the books. Having said that, I'm swamped, so I don't know when I'm going to get to it.

Mmm... I think I've argued myself into "it's fine now, let's push it off".

Assignee: willkg → nobody
Status: ASSIGNED → NEW

Comment #1 doesn't take into account the try location which we treat like a separate bucket.

Also, the task that dominates minidump-stackwalk time is parsing SYM files--not HTTP requests or stackwalking or symbolication.

I want to wait on this, but we should do it before we start GCP things.

I had another idea about this... What if we switched it to:

  1. hit symbols.mozilla.org
  2. hit private symbols bucket

Most symbols are not private symbols, so hitting symbols.mozilla.org is a single HTTP request and takes advantage of Tecken's symbol-exists cache and also marks a missing symbol. Further, when we start the GCP migration and have symbols in both GCP and AWS, we won't have to change Socorro.

The weird case is when the symbol we want is a private symbol. It'll get marked as a missing symbol as a result of not being available via symbols.mozilla.org. However, most symbols aren't private and anyone who doesn't have direct access to the private symbols bucket (which is everyone except socorro) is going to have it marked as a missing symbol, too, so even though it's wrong, I don't think it messes up the bookkeeping in a meaningful way.

The one issue here is that this will increase Tecken usage. I think that'll be fine. The downloads API is pretty fast and minimal.

I'm going to toss this in my queue of things to do in January 2022.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: P3 → P2

I thought about this some more. I want to split out the "mark this as a missing symbol" to a separate endpoint. Then we can do this in Socorro:

SYMBOLS_URLS=https://symbols.mozilla.org/try,PRIVATEBUCKET,https://symbols.mozilla.org/api/missing/

That'll get all the bookkeeping right, be pretty fast (generally), and work as we migrate Tecken.

I'll write up a bug in Tecken for that.

Depends on: 1749407
Depends on: 1774004
No longer depends on: 1749407

We're going to nix the missing symbols bookkeeping altogether. That removes the complexity from this bug and will allow us to do:

SYMBOLS_URLS=https://symbols.mozilla.org/try,PRIVATEBUCKET

I created a PR in the infra repo.

The PR landed. We did a stage deploy.

I checked Grafana (Socorro and Tecken), the Crash Stats stage site, and logs and verified the following things:

  1. the logs show symbols_urls is set correctly for the MinidumpStackwalkerRule in stage
  2. crash reports are getting processed correctly with symbols
  3. there's no noticeable effect on Tecken for download API requests; Socorro stage is roughly 10% of the processing as prod, but Tecken gets so many requests, it looks like it's a drop in the bucket

I think we're good!

This was pushed to prod a few hours ago in bug #1809927. I checked Tecken and I don't see a worrysome change in download API usage, so I think we're going to be fine. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.