Closed Bug 1139545 Opened 9 years ago Closed 9 years ago

[heartbeat] IntegrityError issues

Categories

(Input Graveyard :: Backend, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

(Whiteboard: u=gregg c=heartbeat p=3 s=input.2015q1)

After landing the changes in bug #1136809, I started getting error reports from prod like this:

IntegrityError: (1062, "Duplicate entry 'XXXXXXXXXX' for key 'heartbeat_answer_person_id_498c82c6ca3ccc36_uniq'")

I can't tell more because I can't see the POST data, but it looks like we're trying to save two records at the same time with the same (person, survey, flow) tuple.

This bug covers figuring out the issues and fixing it.
Bug #1136840 was supposed to give me more data on the problem by letting me see the POST data.

Gregg has a related issue: https://github.com/mozilla/self-repair-server/issues/84

I landed some changes for bug #1136840 which should give me better data for all API related things, but I'm wondering if I shouldn't additionally wrap the HB code in question and have it log to the journal to get better data.

Anyhow, hoping to get this analyzed and fixed by tomorrow.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Making this a P1.
Priority: -- → P1
I don't understand what's going on. I haven't seen a single IntegrityError in the last 8 hours.

Possibilities:

1. it was from some test suite and that stopped running? maybe it's a test suite that's CI-driven and gregg stopped landing code?

2. pushing out the changes in bug #1136840 unjammed some chaos monkey in the server causing the problem to go away heisenbugifically

3. some aspect of the fabric of the universe that defies the bounds of my imagination changed making the likelihood of the problem occurring drop like a whale plummeting through the atmosphere

I'll keep an eye on things. I'm really puzzled.
I'm skeptical that HB errors stopped entirely. My current theory is that it's a bug in the error handling code I pushed out yesterday. I added some defensive code around my error handling.

In a PR: https://github.com/mozilla/fjord/pull/510
Fix to handling landed in https://github.com/mozilla/fjord/commit/92ce3b6ea96ff6b9201c44430259f8fcc88d427d

Pushed it out just now. I'll see whether that changes the situation at all. If it doesn't, then it's likely the case that we stopped getting HB IntegrityErrors and I'll drop this in priority until it happens again and I have more data.
The WSGIHandler mixin stunk. I backed it out today and immediately started seeing HB-related IntegrityError issues. I have no idea what about it was so stinky.

Regardless, I wrapped the code in question with some error logging. I really want to see that post body.

In a PR: https://github.com/mozilla/fjord/pull/522

Landed in master: https://github.com/mozilla/fjord/commit/12940b0f01834af01828137f51fe40bc9c72eb78

I'll push this out as soon as I am able.
Pushed to prod just now. Going to wait for the next HB error and then see what's going on.
Yay! Got some data.

Observations:

1. It is a real HB post with real non-test data.

2. The HTTP POST that raised the error and the one that got through have an updated_ts that have a difference of 17. That field is milliseconds since epoch, so that's 17ms. I think this is almost certainly a race condition situation.

For now I'm going to wrap that critical section in a transaction making it atomic. I think that'll fix the issue here without creating new issues.
Did a real fix for this.

In a PR: https://github.com/mozilla/fjord/pull/523

Landed in master: https://github.com/mozilla/fjord/commit/8b61f14d649a0ee5700e4fdd9d1072a438d4264d

Pushed it to prod and watched it for an hour. I haven't seen an HB IntegrityError since plus no new types of errors.

This took several days to figure out, so I'm marking it as a 3 pointer. Closing for now, but if it turns out it's still an issue, I'll reopen.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: u=gregg c=heartbeat p= s=input.2015q1 → u=gregg c=heartbeat p=3 s=input.2015q1
Product: Input → Input Graveyard
You need to log in before you can comment on or make changes to this bug.