Closed Bug 1753521 Opened 2 years ago Closed 2 years ago

stop indexing items from raw_crash

Categories

(Socorro :: Processor, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(2 files)

The super_search_fields.py file specifies which data to index from the raw and processed crash. The raw crash contains the original crash annotation values. The processed crash normalized values, inferred values, and calculated values.

One of the problems we have on a regular-enough-to-frustrate-me basis is discovering we need to normalize a value that's being indexed that comes from the raw crash.

Because of the way the Elasticsearch crash storage thing indexes things, the items from the raw crash are put in a raw_crash namespace. So there's no way to start with indexing a value from the raw crash, discover you need to normalize it, and then index it from the processed crash. What a pain.

I want to switch to indexing only things from the processed crash. In order to not make users sad, we need to migrate from where we're at now to where we want to be. This bug covers that work.

I think what we want to do is something like:

  1. add processor rules to copy data from the raw crash to the processed crash for all items we're indexing from the raw crash
  2. add items to super_search_fields.py to index these fields in the processed_crash namespace, but don't expose them as search fields
  3. wait X months (at most 6, but maybe we can get away with 4?)
  4. remove the raw crash versions from super_search_fields.py and switch the search fields to search the processed crash versions

This idea needs testing. Also, the sooner we do this, the better.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

I had a better plan: we enhance super search fields definitions to allow us to do migrations by specifying source, destination, and search keys. This would solve a long-standing problem where it's currently hard to seamlessly migrate data.

While working on that new plan, I decided to rewrite indexing. Currently, the code:

  1. takes a raw crash and a processed crash
  2. copies them (iterate over entire raw and processed crash)
  3. removes fields that either aren't in the current schema or aren't in the schema for the index the document is going into (iterate over allowed keys and copy raw and processed crash)
  4. fixes the data values (iterate over super search fields multiple times--once for string, keyword, integers, longs, and booleans)
  5. builds a document to index

The new version builds the document by:

  1. traverse search fields (iterate over super search fields)
    1. if it's not an allowed key, continue next loop (allowed keys is a set, so this is O(1))
    2. extract data to index from source key
    3. fix data depending on data type
    4. populate destination keys in new document

This reduces the number of passes we do through things and the use of source, destination, and search keys allows us to migrate data from one place to another in the indexed document without affecting search.

willkg merged PR #6008: "bug 1753521: refactor indexing and start migrating some raw crash fields" in 118eba6.

This needs to hang out on stage until a new index is created and we should make sure querying works across indexes with the old and new mappings. Having said that, I'm feeling pretty good about this.

There's some follow-up work that needs to happen, but it doesn't need to happen all at once--it can happen in stages.

Just landing this stage alone gives us the ability to migrate data around in the index over time which is a huge win.

I tested searching on stage yesterday and everything looked fine as far as I could tell.

I deployed this just now in bug #1756845.

There's still a bunch more fields to migrate, so leaving this open for now.

Bug #1755528 covered fixing flag/boolean fields.

After that, we have two more:

  • collector_notes
  • submitted_timestamp

I'll do those next.

Depends on: 1755528

willkg merged PR #6041: "bug 1753521: fix collector_notes" in e40d76a.

That covers everything. We have the cleanup-step from the migration which we can do in August 2022 or thereabouts. Otherwise, we're done here.

See Also: → 1763264

Everything so far went to production in bug #1763234 just now. Followup work will be done in bug #1763264. Keeping this open to verify everything is working on Monday after we've created a new Elasticsearch index.

Flags: needinfo?(willkg)

I looked at some of the affected fields and they are getting copied to the processed crash and indexed from there. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Flags: needinfo?(willkg)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: