Closed Bug 1100354 Opened 10 years ago Closed 7 years ago

Make a way to export our Super Search Fields list

Categories

(Socorro :: Backend, task, P3)

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: adrian, Assigned: willkg)

References

Details

Attachments

(5 files)

I'd be useful to be able to have an exact copy of our Super Search Fields list to use in local development. A script that would export that list as a JSON file, to be then used by ``scripts/setup_supersearch.py``, sounds like a good solution to me. 

My concern is: do we want to have that in our repo? We currently have a basic set of fields that new installs should use, see https://github.com/mozilla/socorro/blob/master/scripts/data/supersearch_fields.json -- do we want to update that with the latest from our prod database regularly, or keep it to a minimum that would make sense for most installs and let users deal with it themselves?
Maybe we should avoid checking it in altogether and create it as part of the build process instead.
rhelmer suggested that we back this data up in S3.
Can it be generated on the fly and downloaded by whoever wants to use it?
No, it's not available in the public API. Though now that you mention it, maybe it should be? It doesn't have any PII data, it just exposes the names of the PII fields. Is that a cause for concern?
It doesn't even have to be that fancy. It could just be a special URL in the django stuff that exposes it as raw JSON. The whole purpose of this was just for deployment kinda stuff, wasn't it?
So something like a link in the Super Search Fields admin panel to download a JSON blob of all the fields? That could do the trick to get the data locally, but doesn't change the problem of having an automatic backup. What about a simple cron job that pulls the data from ES and puts it into S3?
This is now available through the public API: https://crash-stats.mozilla.com/api/SuperSearchFields/

I have several options in my head:

1. make setup_supersearch.py pull that data from the public API
2. make an automatic export to S3
3. make a script to download that data locally and push it to our git repo

These solve different issues, not too sure yet what's the best list of things to do.
I like 1. That'll greatly help developers. 

However, what does it mean for stage? Isn't setup_supersearch.py for when you start from scratch?
It is. So what we're trying to do here is 1. make a backup of that list (in case ES fails somehow) and 2. make it easy for devs / external users to create an up-to-date list of fields in their database.
Assignee: adrian → nobody
I think we should float the idea of entirely do away with SuperSearch Fields in ES. No more ES indexes to store that. We export that list "today" and check it into git. That'd 100% take care of the backup problem. All code that relies on doing a ES read before it can do SuperSearch querying would be greatly simplified if all it had to do was to read a .json file from disk (like we do with socorro/schemas/crash_report.json). 
It would also entirely solve local dev. And we would be certain that stage always has the same config as prod (except that stage might sometimes be a week or so "ahead" of prod)

Adding or editing fields would become managed by GitHub PRs. 

Non-mozilla implementors would be struggling more. However, we can add it to settings like this:

 SUPERSEARCH_FIELDS_JSON_FILE = config('SUPERSEARCH_FIELDS_JSON_FILE', '{}/socorro/external/es/supersearchfields.json'.format(socorro_root_dir))

Thoughts?
I'm +1 on moving this to the repo, but I haven't looked at this issue long enough to understand the consequences.

Who or what is a "non-mozilla implementor"?
+1
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #11)
> I'm +1 on moving this to the repo, but I haven't looked at this issue long
> enough to understand the consequences.
> 
> Who or what is a "non-mozilla implementor"?

Some other org/company that uses Socorro that we don't know about. 
In all fairness, we're committed to not cater to them and my solution would still allow flexibility. 

I looked a bit more at this. Basically, there are a bunch of places in the code where it does something like this: 
https://github.com/mozilla-services/socorro/blob/bcaf03e8cd00d912047f9ee1f727a80541c2e05a/webapp-django/crashstats/crashstats/views.py#L948 
that would just be replaced with the reading from a .json file. 


Adrian, 
Your "+1" comment helps. Are there any danger areas you can pre-warn us of? For example crontabber apps or curious hacks and exceptions. 
Otherwise I could get started on this soon.
The only thing that might be a little touchy is that the processors need that data as well, in order to create new indices every week. I am not sure how that is going to play out when we switch to Jansky. 

Also, that means you'll want to remove the Super Search Fields admin panel. You probably want to keep the SuperSearchFields class in socorro.external.es and make it a proxy to the new JSON file. That would make the transition quite simple I think, and would be a good first step. 

Note that this will be a trade-off. You will gain consistency across different environments, version control, and "easy" backups, but you will lose the ease of editing and visualizing the current list, and the fast iterative changes (I have used that a lot, made a change to the list, went to Super Search, tested, came back to the list and fixed something). 

As a side-note, and for the sake of writing it down somewhere, I used to be opposed to this change. The reason is that, as the maintainer of that list, I valued the fast iterative changes over the rest, as I found it to be a very important feature when you are dealing with data you do not know. Without that, you will have to be sure what the data in the field you're adding is, and how best to store it / query it. That is doable, but it won't be easy, and testing will have to be delayed. Overall, changing to this new, in-code list, will make the entire process of adding new fields or editing existing ones much slower.
Thanks for that Adrian!

Your point about quick iterations has another side to it too. If the supersearch fields are in a file, then I'd be able to, locally, make a change, start my runserver, test it and if I get it wrong, just edit the file against and try again. 
However, some changes to that list aren't something you can so easily try as it affects the storage so you'd have to run both processors AND webapp locally too.
> The only thing that might be a little touchy is that the processors need that data as well, in order to create new indices every week. I am not sure how that is going to play out when we switch to Jansky. 

Jansky won't be storing anything in ES. We're going to write a crashmover that will live in the same repo as the webapp to do that.
Assignee: nobody → peterbe
Adrian,

I took a stab at this. The first thing I did was to completely change that SuperSearchFields.get_fields() to read from a .json file instead. It was a snapshot based on prod. Next, I deleted a bunch of functionality related to the webapp's manage app. Easy. Things work. Webapp tests work. 

However, the socorro/externa/es unittests are how failing wildly, which brings me to ask you two simple questions:

1. Can you take a look? (I bet it's related to mocking supersearch fields wrong or how other es queries and setUp/tearDown relies on indexes being refreshed)

2. Is this too dangerous to work on now with with regards to https://github.com/mozilla-services/socorro/pull/3676 ?


My branch is here if you have a spare moment. I haven't started writing the README (that is going to be similar to the siglist and json schema README)
https://github.com/mozilla-services/socorro/compare/mozilla-services:master...peterbe:bug-1100354-read-super-search-fields-from-a-json-file
Flags: needinfo?(adrian)
In other words...
Peter, your work is conflicting with the latest ES 5.3 changes that got merged into master. Could you rebase your branch and see that it works with the new code? 

Also, after looking at your branch, I think your problems are indeed just mocking issues. I can help you with this when you're back.
Flags: needinfo?(adrian)
Assignee: peterbe → adrian
Status: NEW → ASSIGNED
Commit pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/673224a186794ff51453a614310555158cb87f55
Fixes bug 1100354 - Read Super Search Fields from a json file (#3844)

* fixes bug 1100354 - read super search fields from a .json file

* WIP trying to get this to work. Currently the index creation does not work because mock, for some reason, replaces the underlying SuperSearchFields object entirely instead of just mocking the get_fields method. There are plenty of prints around to help with debugging this.

* Fixed tests.

* Cleaning up.

* Fixed a few issues.

* Removing remnants.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
This was reverted:

https://github.com/mozilla-services/socorro/commit/58269ed6487471a249b31c0cede4fd1ba46f319c

I'm going to reopen this and we can close it again when it lands.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This was written in such a way that it requires the Elasticsearch 5.x code to land. Given that, I'm making this depend on that.
Depends on: 1342081
Commit pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/e0174c8448a78af8bd3fe515a580c09be90237fe
Reapply ES migration related commits (#3936)

* Revert "Revert "Fixes bug 1100354 - Read Super Search Fields from a json file (#3844)""

This reverts commit ad6ad9930104835e726566f187df3b15e95e2251.

* Revert "Revert "Bug 1100354 - Use proper way of loading a file. (#3845)""

This reverts commit a223c0ae613e08803a9948fc3ad456ca603be904.

* Revert "Revert "[DO NOT MERGE] fixes bug 1379662 - add TelemetryEnvironment to super_search_fields.json (#3848)""

This reverts commit 6ce31fc5b69ee3db42f4983eeff4c10810f30742.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
We recently deferred Elasticsearch 5 migration until after November, but in that discussion, we talked about how much we like this fix.

I'm reopening this to rewrite the code changes so that they work with Elasticsearch 1.x. That'll help with the local dev environment. I'm also removing the dependency on the changes for the Elasticsearch 5 migration.

Plus I'm going to grab it to do next week.
Assignee: adrian → willkg
Status: RESOLVED → REOPENED
No longer depends on: 1342081
Resolution: FIXED → ---
I pulled down the JSON file for super search fields from production and it's almost 4,000 lines and it's incredibly redundant.

I'm wondering whether we should be doing a JSON file in the repo at all or whether it'd be better to do a Python module where we can do classes and constants. That would reduce the size and complexity of the file drastically. It could also alleviate Adrian's "but you will lose the ease of editing and visualizing the current list" issue somewhat.

Another way to do this is to keep the JSON file, but write a cli to ease maintenance.

Anyhow, for now, I'm going to go with the plan that was implemented in the PR that targeted ES 5.3. We'll see how far that goes.
I wouldn't mind a any other format than JSON, because with JSON you can't have comments. 

I wouldn't mind a .py file that lists one class after another.
At the bottom could be a little function so you can get out pretty much the same thing as if you'd do `json.load(open(super_search_fields.json))`. 

As long as people feel empowered to make PRs on the file simply by doing an edit-and-PR in github.com's UI.
I'm going to switch it to be a Python file rather than a JSON file. Being able to do comments is important.

I'll defer refactoring it to a future bug.
super search fields for -stage as of 2017-08-28

$ wget https://crash-stats.allizom.org/api/SuperSearchFields/
$ cat < index.html | jq -S . > super_search_fields_stage.json
super search fields for -prod as of 2017-08-28

$ wget https://crash-stats.mozilla.com/api/SuperSearchFields/
$ cat < index.html | jq -S . > super_search_fields_prod.json
Commit pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/4144894f422e8e101629f0160afdc5234c5a5d48
1100354 supersearch (#3942)

* fixes bug 1100354 - move super search fields to JSON file

This redoes the work in PR #3844 and follow-up PR #3845 and moves super search
fields from being stored and manipulated in a database with associated Django
views to being stored in a JSON file.

* Move JSON file to Python file

* Fix long lines and quotes

* Convert true/false/null -> True/False/Null

* fixes bug 1393181 - Re-apply cache-removal commit

This reapplies d6ecf522. We shouldn't cache super search fields anymore.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Test plan for landing this: https://github.com/mozilla-services/socorro/pull/3942#issuecomment-325441946

1. land the PR
2. wait for -stage to deploy
3. visit the following pages and make sure things look ok:
   
   3.1. https://crash-stats.allizom.org/home/product/Firefox
   3.2. https://crash-stats.allizom.org/topcrashers/?product=Firefox&version=57.0a1&days=7
   3.3. https://crash-stats.allizom.org/signature/?product=Firefox&version=57.0a1&signature=mozilla%3A%3Adom%3A%3ASelection%3A%3AGetPrimaryFrameForFocusNode&date=%3E%3D2017-08-21T18%3A36%3A26.000Z&date=%3C2017-08-28T18%3A36%3A26.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#summary (and its tabs)
   3.4. https://crash-stats.allizom.org/crashes-per-day/?p=Firefox
   3.5. https://crash-stats.allizom.org/search/?product=Firefox&version=57.0a1&date=%3E%3D2017-08-21T18%3A37%3A00.000Z&date=%3C2017-08-28T18%3A37%3A00.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
   3.6. https://crash-stats.allizom.org/report/index/20e25952-7271-44da-8e0a-816260170815#tab-metadata (expect to see AccessibilityClient and AccessibilityInProcClient in the Metadata tab even when NOT signed in with PII permissions)
   3.7. https://crash-stats.allizom.org/search/?accessibility_in_proc_client=0x400&product=Firefox&date=%3E%3D2017-08-08T17%3A04%3A00.000Z&date=%3C2017-08-30T17%3A04%3A00.000Z&_sort=-date&_facets=accessibility_in_proc_client&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-accessibility_in_proc_client (to be able to facet on accessibility_in_proc_client)

That tests Super Search pretty well.

Then we want to compare the old Super Search fields with the new one.

$ wget https://crash-stats.allizom.org/api/SuperSearchFields/
$ cat < index.html | jq -S . > super_search_fields_stage.json

Then compare the output of that with the attachment super_search_fields_stage.json in the bug. There should be some changes in the descriptions, but otherwise it should be the same.

That should cover it.
I went through the test plan and that was all fine.

I went through the super search fields and there are a *ton* of differences. I pulled the super search field data from -prod, so I think that it's all related to -stage not being updated.

1. -stage didn't have any field descriptions

2. there are fields that have values and -stage didn't have the same set of values (example: version -- more in a bit)

3. bunch of analyzers are different

4. -stage didn't have uptime_ts, total_frames, threads_index, thread_count, telemetry_environment, system_error, status, removed_fields, release_channel_test, os_ver, os, main_module, ipc_system_error, ipc_message_size, ipc_fatal_error_protocol, ipc_fatal_error_msg, ipc_extra_system_error, graphics_startup_test, e10s_cohort, dump_cpu_info, dump_cpu_arch, crashing_thread, crash_type, crash_address, cpu_microcode_version, content_sandbox_level, content_sandbox_enabled, content_sandbox_capable, content_sandbox_capabilities, async_plugin_shutdown, app_init_dlls, addons_should_have_blocked_e10s, and TextureUsage

5. -stage had ipc_message_name and graphicsanitytest


I think that's ok in regards to answering the question "do the changes in this bug work?" Also, I think it proves that this bug was a really good idea because -stage and -prod were way out of sync.


However, it opens up some interesting questions like:

1. ipc_message_name and graphicsanitytest were both on -stage, but not on -prod. Should they be in -prod?

2. version lists all the versions of Firefox. Do we need to be updating that every release? If so, is that captured somewhere?


Tagging Peter and Adrian for those last two questions.
Flags: needinfo?(peterbe)
Flags: needinfo?(adrian)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #35)
> I went through the test plan and that was all fine.
> 
> I went through the super search fields and there are a *ton* of differences.
> I pulled the super search field data from -prod, so I think that it's all
> related to -stage not being updated.
> 
> 1. -stage didn't have any field descriptions
> 
> 2. there are fields that have values and -stage didn't have the same set of
> values (example: version -- more in a bit)
> 
> 3. bunch of analyzers are different
> 
> 4. -stage didn't have uptime_ts, total_frames, threads_index, thread_count,
> telemetry_environment, system_error, status, removed_fields,
> release_channel_test, os_ver, os, main_module, ipc_system_error,
> ipc_message_size, ipc_fatal_error_protocol, ipc_fatal_error_msg,
> ipc_extra_system_error, graphics_startup_test, e10s_cohort, dump_cpu_info,
> dump_cpu_arch, crashing_thread, crash_type, crash_address,
> cpu_microcode_version, content_sandbox_level, content_sandbox_enabled,
> content_sandbox_capable, content_sandbox_capabilities,
> async_plugin_shutdown, app_init_dlls, addons_should_have_blocked_e10s, and
> TextureUsage
> 
> 5. -stage had ipc_message_name and graphicsanitytest
> 
> 
> I think that's ok in regards to answering the question "do the changes in
> this bug work?" Also, I think it proves that this bug was a really good idea
> because -stage and -prod were way out of sync.
> 
> 
> However, it opens up some interesting questions like:
> 
> 1. ipc_message_name and graphicsanitytest were both on -stage, but not on
> -prod. Should they be in -prod?
> 
No.
The supersearch fields in stage ES is to be discarded and ignored. The true list is that of prod. 
But when we made the ES --> super_search_fields.json did we do any extra cleanups Adrian?


> 2. version lists all the versions of Firefox. Do we need to be updating that
> every release? If so, is that captured somewhere?
> 

Perhaps it's my 5pm brain but which list do you mean? I don't understand the question.
Flags: needinfo?(peterbe)
(In reply to Peter Bengtsson [:peterbe] from comment #36)
> (In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #35)
> > ...
> > 
> > However, it opens up some interesting questions like:
> > 
> > 1. ipc_message_name and graphicsanitytest were both on -stage, but not on
> > -prod. Should they be in -prod?
> > 
> No.
> The supersearch fields in stage ES is to be discarded and ignored. The true
> list is that of prod. 

Awesome!


> > 2. version lists all the versions of Firefox. Do we need to be updating that
> > every release? If so, is that captured somewhere?
> 
> Perhaps it's my 5pm brain but which list do you mean? I don't understand the
> question.

On -stage, it has form_field_choices for the "version" (lowercase "version") field:

https://crash-stats.allizom.org/api/SuperSearchFields/

On -prod, the form_field_choices for the "version" (lowercase "version") field is an empty list:

https://crash-stats.mozilla.com/api/SuperSearchFields/

I pulled the -prod data and put that in the PR and that landed and merged to master. So I would expect -stage to look like -prod now, but it doesn't.

So where are those versions coming from? Is that something we manually update? Is -stage hosed in some weird way?
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #35)
> 1. ipc_message_name and graphicsanitytest were both on -stage, but not on
> -prod. Should they be in -prod?

+1 to Peter, prod is the One True List.

> 2. version lists all the versions of Firefox. Do we need to be updating that
> every release? If so, is that captured somewhere?

The `form_field_choices` list for `version` should be empty by default. Same as for `platform` and `product`, those are populated at runtime by the webapp, using data pulled from postgresql. That is done only when pulling data via the `search_fields` view [0], which uses the SearchForm class [1]. When pulling from the public API, you get what the file has without any changes, so you should see those `form_field_choices` empty. 

I have no idea why there are values in that field on stage. It might very well be because of how that data was put in there a very long time ago. In any case, it doesn't matter, and that field can be emptied without consequences. Once again, prod holds the Truth. 

[0] https://github.com/mozilla-services/socorro/blob/master/webapp-django/crashstats/supersearch/views.py#L269
[1] https://github.com/mozilla-services/socorro/blob/master/webapp-django/crashstats/supersearch/forms.py
Flags: needinfo?(adrian)
Commit pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/1d526937ec03cd690e1b48a6b57efd7b9d05a7d2
bug 1100354 - fix fields mutations (#3946)

Super search fields come from a Python module now and are handed around and some
of the users mutate the value. Because of that, we want to hand around a copy of
the structure--not the structure itself. This fixes that.

Additionally, this cleans up some comments that were wrong because we're not
storing the fields in JSON anymore.
I landed the follow-up commit, waited for -stage to deploy, went through the test plan, and everything looks ok now. Marking as VERIFIED.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: