Closed Bug 1753521 Opened 2 years ago Closed 2 years ago

stop indexing items from raw_crash

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(2 files)

pr 6008: bug 1753521: refactor indexing and start migrating some raw crash fields 2 years ago Will Kahn-Greene [:willkg] ET needinfo? me 53 bytes, text/x-github-pull-request		Details \| Review
pr 6041: bug 1753521: fix collector_notes 2 years ago Will Kahn-Greene [:willkg] ET needinfo? me 53 bytes, text/x-github-pull-request		Details \| Review

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

2 years ago

The super_search_fields.py file specifies which data to index from the raw and processed crash. The raw crash contains the original crash annotation values. The processed crash normalized values, inferred values, and calculated values.

One of the problems we have on a regular-enough-to-frustrate-me basis is discovering we need to normalize a value that's being indexed that comes from the raw crash.

Because of the way the Elasticsearch crash storage thing indexes things, the items from the raw crash are put in a raw_crash namespace. So there's no way to start with indexing a value from the raw crash, discover you need to normalize it, and then index it from the processed crash. What a pain.

I want to switch to indexing only things from the processed crash. In order to not make users sad, we need to migrate from where we're at now to where we want to be. This bug covers that work.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 1

•

2 years ago

I think what we want to do is something like:

add processor rules to copy data from the raw crash to the processed crash for all items we're indexing from the raw crash
add items to super_search_fields.py to index these fields in the processed_crash namespace, but don't expose them as search fields
wait X months (at most 6, but maybe we can get away with 4?)
remove the raw crash versions from super_search_fields.py and switch the search fields to search the processed crash versions

This idea needs testing. Also, the sooner we do this, the better.

Assignee: nobody → willkg

Status: NEW → ASSIGNED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 2

•

2 years ago

I had a better plan: we enhance super search fields definitions to allow us to do migrations by specifying source, destination, and search keys. This would solve a long-standing problem where it's currently hard to seamlessly migrate data.

While working on that new plan, I decided to rewrite indexing. Currently, the code:

takes a raw crash and a processed crash
copies them (iterate over entire raw and processed crash)
removes fields that either aren't in the current schema or aren't in the schema for the index the document is going into (iterate over allowed keys and copy raw and processed crash)
fixes the data values (iterate over super search fields multiple times--once for string, keyword, integers, longs, and booleans)
builds a document to index

The new version builds the document by:

traverse search fields (iterate over super search fields)
1. if it's not an allowed key, continue next loop (allowed keys is a set, so this is O(1))
2. extract data to index from source key
3. fix data depending on data type
4. populate destination keys in new document

This reduces the number of passes we do through things and the use of source, destination, and search keys allows us to migrate data from one place to another in the indexed document without affecting search.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

2 years ago

Blocks: 1755095

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 3

•

2 years ago

Attached file pr 6008: bug 1753521: refactor indexing and start migrating some raw crash fields — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

2 years ago

•

Edited

willkg merged PR #6008: "bug 1753521: refactor indexing and start migrating some raw crash fields" in 118eba6.

This needs to hang out on stage until a new index is created and we should make sure querying works across indexes with the old and new mappings. Having said that, I'm feeling pretty good about this.

There's some follow-up work that needs to happen, but it doesn't need to happen all at once--it can happen in stages.

Just landing this stage alone gives us the ability to migrate data around in the index over time which is a huge win.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 5

•

2 years ago

I tested searching on stage yesterday and everything looked fine as far as I could tell.

I deployed this just now in bug #1756845.

There's still a bunch more fields to migrate, so leaving this open for now.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 6

•

2 years ago

Bug #1755528 covered fixing flag/boolean fields.

After that, we have two more:

collector_notes
submitted_timestamp

I'll do those next.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

2 years ago

Depends on: 1755528

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 7

•

2 years ago

Attached file pr 6041: bug 1753521: fix collector_notes — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 8

•

2 years ago

willkg merged PR #6041: "bug 1753521: fix collector_notes" in e40d76a.

That covers everything. We have the cleanup-step from the migration which we can do in August 2022 or thereabouts. Otherwise, we're done here.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

2 years ago

Comment 9

•

2 years ago

Everything so far went to production in bug #1763234 just now. Followup work will be done in bug #1763264. Keeping this open to verify everything is working on Monday after we've created a new Elasticsearch index.

Flags: needinfo?(willkg)

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 10

•

2 years ago

I looked at some of the affected fields and they are getting copied to the processed crash and indexed from there. Marking as FIXED.

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

Flags: needinfo?(willkg)

Resolution: --- → FIXED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

2 years ago

Blocks: 1764395

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

stop indexing items from raw_crash

Categories

(Socorro :: Processor, task, P2)

Tracking

(Not tracked)

People

(Reporter: willkg, Assigned: willkg)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Updated

Attachment

General

Description

File Name

Content Type