Closed Bug 1026019 Opened 10 years ago Closed 9 years ago

make bigram generation better

Categories

(Input :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: willkg, Unassigned, Mentored)

Details

One of the things we generate from an Input response is a list of bigrams of the text. This requires two steps:

1. tokenizing the text into a list of tokens
2. going through the tokens building tuples of tokens that are adjacent

That results in a list of bigrams.

Currently the tokenizing step is done with Elasticsearch analysis. This requires a round-trip to ES every time we generate the bigrams for a feedback response. This is pretty expensive, but it was done that way to speed up implementation.

This bug covers reworking that code so that it doesn't do the expensive Elasticsearch trip and further hone the code so that it produces useful bigram lists that take into account the following things:

1. stop words -- We have a configured list of stop words, but it's probably the case we need to expand this list.

2. stemming -- We're not doing stemming right now so words like "use" and "using" end up being two different words. We need to account for stemming, but we can't have bigrams like "us firefox". We need that to be "use firefox". I'm not sure what the options are here.

3. spelling fixes -- We'd have better bigram data if we could correct peoples' spelling. I don't know how much better the data would be, so we should figure out how to measure that first.


This is a big-gish bug, so we'll split it into smaller bugs as people want to tackle portions of it.
Making this a mentorable bug. There's a lot of stuff in this bug, but feel free to break off a tiny aspect that interests you to work on.

The code in question is in fjord/feedback/models.py in Response.get_document. That calls "compute_grams" which is in fjord/feedback/utils.py. That code tokenizes the text and then does some work on it.

If you're interested in working on this, let me know either in the bug comments here or on irc (I'm willkg).
Whiteboard: [mentor=willkg]
After writing this up, I decided the Elasticsearch (ab)use was too much. I changed it to a much simpler tokenizer that uses a regexp.

Everything else in this bug is still valid, though.

Outstanding issues:

1. making the tokenizing better: There are some very rough tests. It'd be good to flesh those out and find cases where the new tokenizing doesn't do well and fix those. We may need to switch to a real tokenizer rather than using the existing regexp-split.

2. better stop words list: More analysis on the bigrams generated will likely unveil new stop words we should add to the list.

3. handle stemming: The new code still doesn't do any stemming.
New tokenizer in pull request: https://github.com/mozilla/fjord/pull/308

Landed in master: https://github.com/mozilla/fjord/commit/dccfb76

Pushed to production just now.
Mentor: willkg
Whiteboard: [mentor=willkg]
I'm interested in tackling what remains of this bug, is it still a mentorable bug?

Is this the best place to get started ?

http://fjord.readthedocs.org/en/latest/hacking_howto.html
It is still marked as mentored. Comment #2 covers the outstanding issues.

Best place to get started is to set up a Fjord development environment. Over the weekend, I redid everything to use Vagrant which theoretically makes getting started a lot easier. That code hasn't landed, yet, but probably should this week.

My vote is wait until that lands and then things should be a lot easier. I'll let you know when it lands.

Flagging myself as needsinfo so I don't forget.
Flags: needinfo?(willkg)
We've switched to Vagrant and building a development environment is a lot easier now.

Documentation for setting up the new build environment is here:

http://fjord.readthedocs.org/en/latest/getting_started.html

Are you still interested in working on this?
Flags: needinfo?(willkg)
We nixed bigram generation, so I'm closing this as WONTFIX.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.