Closed Bug 1101285 Opened 10 years ago Closed 10 years ago

Rebooting through slave health fails with "400 Client Error: Bad Request" if problem tracking bug can't be retrieved

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: coop)

Details

Attachments

(2 files)

Something must have changed, maybe in the bugzilla api, so that now when slave health tries to look up a problem tracking bug which doesn't exist, rather than just quickly putting up the icon for "file a new bug" like it used to, it now spins forever. That'd be fine, except apparently rebooting requires either the bug number or the certain knowledge that the bug does not exist to work properly, so for a slave which doesn't yet have a tracking bug, like https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-linux64-ix&name=talos-linux64-ix-006, if you try to reboot it, you just get a quick "400 Client Error: Bad Request"
Oh, and I did verify that seemingly-wild assertion that rebooting requires a bug, by filing one for https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-xp32-ix&name=t-xp32-ix-024 and then successfully rebooting it once it was filed.
odd -- bug 1089805 seems most related, but that was pushed into production a week ago (2014-11-10).

are you able to provide details of the request, the expected response, and the actual response?
Flags: needinfo?(bugspam.Callek)
http://mxr.mozilla.org/build/source/slave_health/js/bugzilla.js#132 and bug 1036802 - we request https://bugzilla.mozilla.org/rest/bug/thisaliasdoesnotexist and wait until it returns 200, and since it now can return a 400 Bad Request, we wait forever for that 200.
Flags: needinfo?(bugspam.Callek)
FWIW this corresponds to:

2014-11-18 18:29:55,291 - INFO - -=- - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-xp32-ix-139
2014-11-18 18:29:55,578 - ERROR - -=- - Exception on /slaves/t-xp32-ix-139 [GET]
Traceback (most recent call last):
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 149, in dispatch_request
    return meth(*args, **kwargs)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/web/slave.py", line 21, in get
    slave.load_all_info()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 42, in load_all_info
    self.load_bug_info()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 82, in load_bug_info
    self.bug.refresh()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/bugzilla.py", line 12, in refresh
    self.data = bugzilla_client.get_bug(self.id_)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 58, in get_bug
    return self.request("GET", "bug/%s" % id_, data)["bugs"][0]
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 40, in request
Err that was:

2014-11-18 18:29:55,578 - ERROR - -=- - Exception on /slaves/t-xp32-ix-139 [GET]
Traceback (most recent call last):
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 149, in dispatch_request
    return meth(*args, **kwargs)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/web/slave.py", line 21, in get
    slave.load_all_info()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 42, in load_all_info
    self.load_bug_info()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 82, in load_bug_info
    self.bug.refresh()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/bugzilla.py", line 12, in refresh
    self.data = bugzilla_client.get_bug(self.id_)
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 58, in get_bug
    return self.request("GET", "bug/%s" % id_, data)["bugs"][0]
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 40, in request
    r.raise_for_status()
  File "/builds/slaveapi/prod/lib/python2.7/site-packages/requests/models.py", line 683, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 400 Client Error: Bad Request
2014-11-18 18:29:55,579 - INFO - -=- - 10.22.81.88 - - [2014-11-18 18:29:55] "GET /slaves/t-xp32-ix-139 HTTP/1.1" 500 412 0.784816

Doesn't matter the slave name only that the bug doesn't exist.

Earliest hit (in the server side logs) is 2014-11-18 17:02:01 UTC
Philor reports this also affects slaverebooter. Bumping priority, and will look at this myself now.
Severity: normal → major
I thought it was related to bug 1095747 but timing wise, it looks like that only moved today, whereas this bug was raised yesterday.
That'd be another, as yet unfiled, bug, because it's either going to or already has hit an unreachable and tried to file a dcops bug and failed.

This one is the combination of http://mxr.mozilla.org/build/source/slave_health/js/bugzilla.js#132 and some similar spot in slaverebooter code which assume that all bugzilla api usage will return a 200 status code, when now that bug 1036802 is fixed a search for an alias which doesn't exist returns 400.
(In reply to Phil Ringnalda (:philor) from comment #9)
> This one is the combination of
> http://mxr.mozilla.org/build/source/slave_health/js/bugzilla.js#132 and some
> similar spot in slaverebooter code which assume that all bugzilla api usage
> will return a 200 status code, when now that bug 1036802 is fixed a search
> for an alias which doesn't exist returns 400.

Fixed: https://hg.mozilla.org/build/slave_health/rev/795167a8fec8
Verified, thanks.

Looks like slaverebooter is probably wedged again, so I'm having trouble confirming what I'm pretty sure I saw yesterday, that it wasn't trying the ones without a bug, since now it isn't doing anything at all.
Or maybe https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-linux64-ix&name=talos-linux64-ix-049 where it no longer spins, but the reboot still dies with "400 Client Error: Bad Request" is confirmation that slaverebooter/slaveapi still suffers from the same belief that it will always get a 200 and that a 400 means "pass along and fail" rather than "404 Alias Not Found".
Assignee: nobody → dylan
we were incorrectly returning 200 responses even if an error occurred; the current behaviour is the correct one, and you'll have to update your scripts to accommodate more than just a 200 response.
i'm sorry for the pain this caused; i did a bad job at communicating this change and didn't fully consider its impact on scripts.  i'll try to be more careful about these sorts of things in future.
copy-paste in http://mxr.mozilla.org/build/source/tools/lib/python/slavealloc/www/js/bugzilla.js#135 but surely nobody uses slavealloc directly anymore, do they?
I dug into slaveapi a bit today, and am testing a patch in staging that wraps the bug requests in a try block for some recoverability. There still seems to be a few/many cases where slaveapi returns ISEs outside of this though. I might want to dig into those cases as a second pass. I'm not confident enough in the patch yet to deploy it, so I'll commit to checking in on slaverebooter over the weekend. 

slaverebooter itself is a mess of threading. I'm not sure whether the ISEs from slaveapi are what is exhausting the threadpool. I've lowered the overall timeout on slaverebooter from 30min to 2min. No sense waiting around longer than necessary if you're not going to get anything done.
Assignee: dylan → coop
Status: NEW → ASSIGNED
Priority: -- → P2
This is running in the staging env right now. Here's an example for a slave that has no problem tracking bug yet (talos-linux32-ix-010):

http://slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/talos-linux32-ix-010
Attachment #8528409 - Flags: review?(pmoore)
Comment on attachment 8528409 [details] [diff] [review]
Catch HTTPErrors in slaveapi

Review of attachment 8528409 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Coop!

I haven't set up a staging environment to test in, but since you have, that is good enough for me.

I see from the code it is essentially allowing an HTTP Error to occur, without falling over, which can only be a good thing, unless anything later depends on the data or the id being set. That said, if it isn't, I can't imagine much harm could be done, because we would not be able to publish an update to a bug without an id, so there should be no risk of damage. Worst case, it would crash, which is what it would do at the moment if we had an HTTP Error (without your fix) so worst case: no improvement; best case: not throwing an exception and surviving. So r+... I think!

Feel free to get an r? from someone else if you feel unsafe with my review! I have to clock off now, so I can also do a more thorough review in the morning and set up a staging environment, if you prefer. Just let me know!

Pete
Attachment #8528409 - Flags: review?(pmoore) → review+
IMO this would be best fixed in bzrest, since it has code to raise the specific exceptions we already trap, but is now failing due to the break in bmo expectations.

Also, while you're here... theres 1073630 awaiting landing, which I can't land myself until Bug 1103089 is resolved.
Comment on attachment 8528409 [details] [diff] [review]
Catch HTTPErrors in slaveapi

Review of attachment 8528409 [details] [diff] [review]:
-----------------------------------------------------------------

http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=11a580f6b019f992cf3d4fe69bc4e7d033682e27
Attachment #8528409 - Flags: checked-in+
(In reply to Pete Moore [:pete][:pmoore] from comment #18) 
> I see from the code it is essentially allowing an HTTP Error to occur,
> without falling over, which can only be a good thing, unless anything later
> depends on the data or the id being set. That said, if it isn't, I can't
> imagine much harm could be done, because we would not be able to publish an
> update to a bug without an id, so there should be no risk of damage. Worst
> case, it would crash, which is what it would do at the moment if we had an
> HTTP Error (without your fix) so worst case: no improvement; best case: not
> throwing an exception and surviving. So r+... I think!

We end up with a null entry for the bug, which is the same behavior we had before when bugzilla was returning 200 codes.
(In reply to Justin Wood (:Callek) from comment #19)
> IMO this would be best fixed in bzrest, since it has code to raise the
> specific exceptions we already trap, but is now failing due to the break in
> bmo expectations.

I don't disagree, but I'm not adding a pull request to bhearsum in the critical path when we have a working fix.

> Also, while you're here... theres 1073630 awaiting landing, which I can't
> land myself until Bug 1103089 is resolved.

Done.
Attachment #8528529 - Flags: review?(bugspam.Callek)
(In reply to Chris Cooper [:coop] from comment #22)
> (In reply to Justin Wood (:Callek) from comment #19)
> > IMO this would be best fixed in bzrest, since it has code to raise the
> > specific exceptions we already trap, but is now failing due to the break in
> > bmo expectations.
> 
> I don't disagree, but I'm not adding a pull request to bhearsum in the
> critical path when we have a working fix.

Marking for follow-up:

https://github.com/bhearsum/bzrest/blob/master/bzrest/client.py#L47
Comment on attachment 8528529 [details] [diff] [review]
Bump slaveapi version in puppet

Review of attachment 8528529 [details] [diff] [review]:
-----------------------------------------------------------------

r+ from Callek in IRC.

https://hg.mozilla.org/build/puppet/rev/94bac2d8b558
Attachment #8528529 - Flags: review?(bugspam.Callek)
Attachment #8528529 - Flags: review+
Attachment #8528529 - Flags: checked-in+
This has been deployed now.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: