1101285 - Rebooting through slave health fails with "400 Client Error: Bad Request" if problem tracking bug can't be retrieved

Reporter

Description

•

10 years ago

Something must have changed, maybe in the bugzilla api, so that now when slave health tries to look up a problem tracking bug which doesn't exist, rather than just quickly putting up the icon for "file a new bug" like it used to, it now spins forever. That'd be fine, except apparently rebooting requires either the bug number or the certain knowledge that the bug does not exist to work properly, so for a slave which doesn't yet have a tracking bug, like https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-linux64-ix&name=talos-linux64-ix-006, if you try to reboot it, you just get a quick "400 Client Error: Bad Request"

Phil Ringnalda (:philor)

Reporter

Comment 1

•

10 years ago

Oh, and I did verify that seemingly-wild assertion that rebooting requires a bug, by filing one for https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-xp32-ix&name=t-xp32-ix-024 and then successfully rebooting it once it was filed.

Justin Wood (:Callek)

Comment 2

•

10 years ago

Probably a regression from yesterdays bmo push:

https://wiki.mozilla.org/index.php?title=BMO/Recent_Changes&diff=prev&oldid=1034307

:glob ✱

Comment 3

•

10 years ago

odd -- bug 1089805 seems most related, but that was pushed into production a week ago (2014-11-10).

are you able to provide details of the request, the expected response, and the actual response?

Flags: needinfo?(bugspam.Callek)

Phil Ringnalda (:philor)

Reporter

Comment 4

•

10 years ago

http://mxr.mozilla.org/build/source/slave_health/js/bugzilla.js#132 and bug 1036802 - we request https://bugzilla.mozilla.org/rest/bug/thisaliasdoesnotexist and wait until it returns 200, and since it now can return a 400 Bad Request, we wait forever for that 200.

Flags: needinfo?(bugspam.Callek)

Justin Wood (:Callek)

Comment 5

•

10 years ago

FWIW this corresponds to:

2014-11-18 18:29:55,291 - INFO - -=- - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-xp32-ix-139
2014-11-18 18:29:55,578 - ERROR - -=- - Exception on /slaves/t-xp32-ix-139 [GET]
Traceback (most recent call last):
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 149, in dispatch_request
    return meth(*args, **kwargs)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/web/slave.py", line 21, in get
    slave.load_all_info()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 42, in load_all_info
    self.load_bug_info()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 82, in load_bug_info
    self.bug.refresh()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/bugzilla.py", line 12, in refresh
    self.data = bugzilla_client.get_bug(self.id_)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 58, in get_bug
    return self.request("GET", "bug/%s" % id_, data)["bugs"][0]
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 40, in request

Justin Wood (:Callek)

Comment 6

•

10 years ago

Err that was:

2014-11-18 18:29:55,578 - ERROR - -=- - Exception on /slaves/t-xp32-ix-139 [GET]
Traceback (most recent call last):
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 149, in dispatch_request
    return meth(*args, **kwargs)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/web/slave.py", line 21, in get
    slave.load_all_info()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 42, in load_all_info
    self.load_bug_info()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 82, in load_bug_info
    self.bug.refresh()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/bugzilla.py", line 12, in refresh
    self.data = bugzilla_client.get_bug(self.id_)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 58, in get_bug
    return self.request("GET", "bug/%s" % id_, data)["bugs"][0]
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 40, in request
    r.raise_for_status()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/requests/models.py", line 683, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 400 Client Error: Bad Request
2014-11-18 18:29:55,579 - INFO - -=- - 10.22.81.88 - - [2014-11-18 18:29:55] "GET /slaves/t-xp32-ix-139 HTTP/1.1" 500 412 0.784816

Doesn't matter the slave name only that the bug doesn't exist.

Earliest hit (in the server side logs) is 2014-11-18 17:02:01 UTC

Pete Moore [:pmoore][:pete]

Comment 7

•

10 years ago

Philor reports this also affects slaverebooter. Bumping priority, and will look at this myself now.

Severity: normal → major

Pete Moore [:pmoore][:pete]

Comment 8

•

10 years ago

I thought it was related to bug 1095747 but timing wise, it looks like that only moved today, whereas this bug was raised yesterday.

Phil Ringnalda (:philor)

Reporter

Comment 9

•

10 years ago

That'd be another, as yet unfiled, bug, because it's either going to or already has hit an unreachable and tried to file a dcops bug and failed.

This one is the combination of http://mxr.mozilla.org/build/source/slave_health/js/bugzilla.js#132 and some similar spot in slaverebooter code which assume that all bugzilla api usage will return a 200 status code, when now that bug 1036802 is fixed a search for an alias which doesn't exist returns 400.

Chris Cooper [:coop] (he/him)

Assignee

Comment 10

•

10 years ago

(In reply to Phil Ringnalda (:philor) from comment #9)
> This one is the combination of
> http://mxr.mozilla.org/build/source/slave_health/js/bugzilla.js#132 and some
> similar spot in slaverebooter code which assume that all bugzilla api usage
> will return a 200 status code, when now that bug 1036802 is fixed a search
> for an alias which doesn't exist returns 400.

Fixed: https://hg.mozilla.org/build/slave_health/rev/795167a8fec8

Phil Ringnalda (:philor)

Reporter

Comment 11

•

10 years ago

Verified, thanks.

Looks like slaverebooter is probably wedged again, so I'm having trouble confirming what I'm pretty sure I saw yesterday, that it wasn't trying the ones without a bug, since now it isn't doing anything at all.

Phil Ringnalda (:philor)

Reporter

Comment 12

•

10 years ago

Or maybe https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-linux64-ix&name=talos-linux64-ix-049 where it no longer spins, but the reboot still dies with "400 Client Error: Bad Request" is confirmation that slaverebooter/slaveapi still suffers from the same belief that it will always get a 200 and that a 400 means "pass along and fail" rather than "404 Alias Not Found".

Dylan Hardison [:dylan] (he/him)

Updated

•

10 years ago

Assignee: nobody → dylan

:glob ✱

Comment 13

•

10 years ago

we were incorrectly returning 200 responses even if an error occurred; the current behaviour is the correct one, and you'll have to update your scripts to accommodate more than just a 200 response.

:glob ✱

Comment 14

•

10 years ago

i'm sorry for the pain this caused; i did a bad job at communicating this change and didn't fully consider its impact on scripts.  i'll try to be more careful about these sorts of things in future.

Phil Ringnalda (:philor)

Reporter

Comment 15

•

10 years ago

copy-paste in http://mxr.mozilla.org/build/source/tools/lib/python/slavealloc/www/js/bugzilla.js#135 but surely nobody uses slavealloc directly anymore, do they?

Chris Cooper [:coop] (he/him)

Assignee

Comment 16

•

10 years ago

I dug into slaveapi a bit today, and am testing a patch in staging that wraps the bug requests in a try block for some recoverability. There still seems to be a few/many cases where slaveapi returns ISEs outside of this though. I might want to dig into those cases as a second pass. I'm not confident enough in the patch yet to deploy it, so I'll commit to checking in on slaverebooter over the weekend. 

slaverebooter itself is a mess of threading. I'm not sure whether the ISEs from slaveapi are what is exhausting the threadpool. I've lowered the overall timeout on slaverebooter from 30min to 2min. No sense waiting around longer than necessary if you're not going to get anything done.

Chris Cooper [:coop] (he/him)

Assignee

Updated

•

10 years ago

Assignee: dylan → coop

Status: NEW → ASSIGNED

Priority: -- → P2

Chris Cooper [:coop] (he/him)

Assignee

Comment 17

•

10 years ago

Attached patch Catch HTTPErrors in slaveapi — Details — Splinter Review

This is running in the staging env right now. Here's an example for a slave that has no problem tracking bug yet (talos-linux32-ix-010):

http://slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/talos-linux32-ix-010

Attachment #8528409 - Flags: review?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 18

•

10 years ago

Comment on attachment 8528409 [details] [diff] [review]
Catch HTTPErrors in slaveapi

Review of attachment 8528409 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Coop!

I haven't set up a staging environment to test in, but since you have, that is good enough for me.

I see from the code it is essentially allowing an HTTP Error to occur, without falling over, which can only be a good thing, unless anything later depends on the data or the id being set. That said, if it isn't, I can't imagine much harm could be done, because we would not be able to publish an update to a bug without an id, so there should be no risk of damage. Worst case, it would crash, which is what it would do at the moment if we had an HTTP Error (without your fix) so worst case: no improvement; best case: not throwing an exception and surviving. So r+... I think!

Feel free to get an r? from someone else if you feel unsafe with my review! I have to clock off now, so I can also do a more thorough review in the morning and set up a staging environment, if you prefer. Just let me know!

Pete

Attachment #8528409 - Flags: review?(pmoore) → review+

Justin Wood (:Callek)

Comment 19

•

10 years ago

IMO this would be best fixed in bzrest, since it has code to raise the specific exceptions we already trap, but is now failing due to the break in bmo expectations.

Also, while you're here... theres 1073630 awaiting landing, which I can't land myself until Bug 1103089 is resolved.

Chris Cooper [:coop] (he/him)

Assignee

Comment 20

•

10 years ago

Comment on attachment 8528409 [details] [diff] [review]
Catch HTTPErrors in slaveapi

Review of attachment 8528409 [details] [diff] [review]:
-----------------------------------------------------------------

http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=11a580f6b019f992cf3d4fe69bc4e7d033682e27

Attachment #8528409 - Flags: checked-in+

Chris Cooper [:coop] (he/him)

Assignee

Comment 21

•

10 years ago

(In reply to Pete Moore [:pete][:pmoore] from comment #18) 
> I see from the code it is essentially allowing an HTTP Error to occur,
> without falling over, which can only be a good thing, unless anything later
> depends on the data or the id being set. That said, if it isn't, I can't
> imagine much harm could be done, because we would not be able to publish an
> update to a bug without an id, so there should be no risk of damage. Worst
> case, it would crash, which is what it would do at the moment if we had an
> HTTP Error (without your fix) so worst case: no improvement; best case: not
> throwing an exception and surviving. So r+... I think!

We end up with a null entry for the bug, which is the same behavior we had before when bugzilla was returning 200 codes.

Chris Cooper [:coop] (he/him)

Assignee

Comment 22

•

10 years ago

(In reply to Justin Wood (:Callek) from comment #19)
> IMO this would be best fixed in bzrest, since it has code to raise the
> specific exceptions we already trap, but is now failing due to the break in
> bmo expectations.

I don't disagree, but I'm not adding a pull request to bhearsum in the critical path when we have a working fix.

> Also, while you're here... theres 1073630 awaiting landing, which I can't
> land myself until Bug 1103089 is resolved.

Done.

Chris Cooper [:coop] (he/him)

Assignee

Comment 23

•

10 years ago

Attached patch Bump slaveapi version in puppet — Details — Splinter Review

Attachment #8528529 - Flags: review?(bugspam.Callek)

Chris Cooper [:coop] (he/him)

Assignee

Comment 24

•

10 years ago

(In reply to Chris Cooper [:coop] from comment #22)
> (In reply to Justin Wood (:Callek) from comment #19)
> > IMO this would be best fixed in bzrest, since it has code to raise the
> > specific exceptions we already trap, but is now failing due to the break in
> > bmo expectations.
> 
> I don't disagree, but I'm not adding a pull request to bhearsum in the
> critical path when we have a working fix.

Marking for follow-up:

https://github.com/bhearsum/bzrest/blob/master/bzrest/client.py#L47

Chris Cooper [:coop] (he/him)

Assignee

Comment 25

•

10 years ago

Comment on attachment 8528529 [details] [diff] [review]
Bump slaveapi version in puppet

Review of attachment 8528529 [details] [diff] [review]:
-----------------------------------------------------------------

r+ from Callek in IRC.

https://hg.mozilla.org/build/puppet/rev/94bac2d8b558

Attachment #8528529 - Flags: review?(bugspam.Callek)

Attachment #8528529 - Flags: review+

Attachment #8528529 - Flags: checked-in+

Chris Cooper [:coop] (he/him)

Assignee

Comment 26

•

10 years ago

This has been deployed now.

Chris Cooper [:coop] (he/him)

Assignee

Updated

•

10 years ago

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Tools → General

Catch HTTPErrors in slaveapi 10 years ago Chris Cooper [:coop] (he/him) 1.18 KB, patch	pmoore : review+ coop : checked-in+	Details \| Diff \| Splinter Review
Bump slaveapi version in puppet 10 years ago Chris Cooper [:coop] (he/him) 724 bytes, patch	coop : review+ coop : checked-in+	Details \| Diff \| Splinter Review