Closed
Bug 1101285
Opened 10 years ago
Closed 10 years ago
Rebooting through slave health fails with "400 Client Error: Bad Request" if problem tracking bug can't be retrieved
Categories
(Release Engineering :: General, defect, P2)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: coop)
Details
Attachments
(2 files)
1.18 KB,
patch
|
pmoore
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
724 bytes,
patch
|
coop
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
Something must have changed, maybe in the bugzilla api, so that now when slave health tries to look up a problem tracking bug which doesn't exist, rather than just quickly putting up the icon for "file a new bug" like it used to, it now spins forever. That'd be fine, except apparently rebooting requires either the bug number or the certain knowledge that the bug does not exist to work properly, so for a slave which doesn't yet have a tracking bug, like https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-linux64-ix&name=talos-linux64-ix-006, if you try to reboot it, you just get a quick "400 Client Error: Bad Request"
Reporter | ||
Comment 1•10 years ago
|
||
Oh, and I did verify that seemingly-wild assertion that rebooting requires a bug, by filing one for https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-xp32-ix&name=t-xp32-ix-024 and then successfully rebooting it once it was filed.
Comment 2•10 years ago
|
||
Probably a regression from yesterdays bmo push: https://wiki.mozilla.org/index.php?title=BMO/Recent_Changes&diff=prev&oldid=1034307
odd -- bug 1089805 seems most related, but that was pushed into production a week ago (2014-11-10). are you able to provide details of the request, the expected response, and the actual response?
Flags: needinfo?(bugspam.Callek)
Reporter | ||
Comment 4•10 years ago
|
||
http://mxr.mozilla.org/build/source/slave_health/js/bugzilla.js#132 and bug 1036802 - we request https://bugzilla.mozilla.org/rest/bug/thisaliasdoesnotexist and wait until it returns 200, and since it now can return a 400 Bad Request, we wait forever for that 200.
Flags: needinfo?(bugspam.Callek)
Comment 5•10 years ago
|
||
FWIW this corresponds to: 2014-11-18 18:29:55,291 - INFO - -=- - Sending request: GET https://bugzilla.mozilla.org/rest/bug/t-xp32-ix-139 2014-11-18 18:29:55,578 - ERROR - -=- - Exception on /slaves/t-xp32-ix-139 [GET] Traceback (most recent call last): File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app response = self.full_dispatch_request() File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request rv = self.handle_user_exception(e) File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception reraise(exc_type, exc_value, tb) File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request rv = self.dispatch_request() File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 84, in view return self.dispatch_request(*args, **kwargs) File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 149, in dispatch_request return meth(*args, **kwargs) File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/web/slave.py", line 21, in get slave.load_all_info() File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 42, in load_all_info self.load_bug_info() File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 82, in load_bug_info self.bug.refresh() File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/bugzilla.py", line 12, in refresh self.data = bugzilla_client.get_bug(self.id_) File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 58, in get_bug return self.request("GET", "bug/%s" % id_, data)["bugs"][0] File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 40, in request
Comment 6•10 years ago
|
||
Err that was: 2014-11-18 18:29:55,578 - ERROR - -=- - Exception on /slaves/t-xp32-ix-139 [GET] Traceback (most recent call last): File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app response = self.full_dispatch_request() File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request rv = self.handle_user_exception(e) File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception reraise(exc_type, exc_value, tb) File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request rv = self.dispatch_request() File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 84, in view return self.dispatch_request(*args, **kwargs) File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 149, in dispatch_request return meth(*args, **kwargs) File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/web/slave.py", line 21, in get slave.load_all_info() File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 42, in load_all_info self.load_bug_info() File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 82, in load_bug_info self.bug.refresh() File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/bugzilla.py", line 12, in refresh self.data = bugzilla_client.get_bug(self.id_) File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 58, in get_bug return self.request("GET", "bug/%s" % id_, data)["bugs"][0] File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line 40, in request r.raise_for_status() File "/builds/slaveapi/prod/lib/python2.7/site-packages/requests/models.py", line 683, in raise_for_status raise HTTPError(http_error_msg, response=self) HTTPError: 400 Client Error: Bad Request 2014-11-18 18:29:55,579 - INFO - -=- - 10.22.81.88 - - [2014-11-18 18:29:55] "GET /slaves/t-xp32-ix-139 HTTP/1.1" 500 412 0.784816 Doesn't matter the slave name only that the bug doesn't exist. Earliest hit (in the server side logs) is 2014-11-18 17:02:01 UTC
Comment 7•10 years ago
|
||
Philor reports this also affects slaverebooter. Bumping priority, and will look at this myself now.
Severity: normal → major
Comment 8•10 years ago
|
||
I thought it was related to bug 1095747 but timing wise, it looks like that only moved today, whereas this bug was raised yesterday.
Reporter | ||
Comment 9•10 years ago
|
||
That'd be another, as yet unfiled, bug, because it's either going to or already has hit an unreachable and tried to file a dcops bug and failed. This one is the combination of http://mxr.mozilla.org/build/source/slave_health/js/bugzilla.js#132 and some similar spot in slaverebooter code which assume that all bugzilla api usage will return a 200 status code, when now that bug 1036802 is fixed a search for an alias which doesn't exist returns 400.
Assignee | ||
Comment 10•10 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #9) > This one is the combination of > http://mxr.mozilla.org/build/source/slave_health/js/bugzilla.js#132 and some > similar spot in slaverebooter code which assume that all bugzilla api usage > will return a 200 status code, when now that bug 1036802 is fixed a search > for an alias which doesn't exist returns 400. Fixed: https://hg.mozilla.org/build/slave_health/rev/795167a8fec8
Reporter | ||
Comment 11•10 years ago
|
||
Verified, thanks. Looks like slaverebooter is probably wedged again, so I'm having trouble confirming what I'm pretty sure I saw yesterday, that it wasn't trying the ones without a bug, since now it isn't doing anything at all.
Reporter | ||
Comment 12•10 years ago
|
||
Or maybe https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=talos-linux64-ix&name=talos-linux64-ix-049 where it no longer spins, but the reboot still dies with "400 Client Error: Bad Request" is confirmation that slaverebooter/slaveapi still suffers from the same belief that it will always get a 200 and that a 400 means "pass along and fail" rather than "404 Alias Not Found".
Updated•10 years ago
|
Assignee: nobody → dylan
Comment 13•10 years ago
|
||
we were incorrectly returning 200 responses even if an error occurred; the current behaviour is the correct one, and you'll have to update your scripts to accommodate more than just a 200 response.
Comment 14•10 years ago
|
||
i'm sorry for the pain this caused; i did a bad job at communicating this change and didn't fully consider its impact on scripts. i'll try to be more careful about these sorts of things in future.
Reporter | ||
Comment 15•10 years ago
|
||
copy-paste in http://mxr.mozilla.org/build/source/tools/lib/python/slavealloc/www/js/bugzilla.js#135 but surely nobody uses slavealloc directly anymore, do they?
Assignee | ||
Comment 16•10 years ago
|
||
I dug into slaveapi a bit today, and am testing a patch in staging that wraps the bug requests in a try block for some recoverability. There still seems to be a few/many cases where slaveapi returns ISEs outside of this though. I might want to dig into those cases as a second pass. I'm not confident enough in the patch yet to deploy it, so I'll commit to checking in on slaverebooter over the weekend. slaverebooter itself is a mess of threading. I'm not sure whether the ISEs from slaveapi are what is exhausting the threadpool. I've lowered the overall timeout on slaverebooter from 30min to 2min. No sense waiting around longer than necessary if you're not going to get anything done.
Assignee | ||
Updated•10 years ago
|
Assignee: dylan → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Assignee | ||
Comment 17•10 years ago
|
||
This is running in the staging env right now. Here's an example for a slave that has no problem tracking bug yet (talos-linux32-ix-010): http://slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/talos-linux32-ix-010
Attachment #8528409 -
Flags: review?(pmoore)
Comment 18•10 years ago
|
||
Comment on attachment 8528409 [details] [diff] [review] Catch HTTPErrors in slaveapi Review of attachment 8528409 [details] [diff] [review]: ----------------------------------------------------------------- Thanks Coop! I haven't set up a staging environment to test in, but since you have, that is good enough for me. I see from the code it is essentially allowing an HTTP Error to occur, without falling over, which can only be a good thing, unless anything later depends on the data or the id being set. That said, if it isn't, I can't imagine much harm could be done, because we would not be able to publish an update to a bug without an id, so there should be no risk of damage. Worst case, it would crash, which is what it would do at the moment if we had an HTTP Error (without your fix) so worst case: no improvement; best case: not throwing an exception and surviving. So r+... I think! Feel free to get an r? from someone else if you feel unsafe with my review! I have to clock off now, so I can also do a more thorough review in the morning and set up a staging environment, if you prefer. Just let me know! Pete
Attachment #8528409 -
Flags: review?(pmoore) → review+
Comment 19•10 years ago
|
||
IMO this would be best fixed in bzrest, since it has code to raise the specific exceptions we already trap, but is now failing due to the break in bmo expectations. Also, while you're here... theres 1073630 awaiting landing, which I can't land myself until Bug 1103089 is resolved.
Assignee | ||
Comment 20•10 years ago
|
||
Comment on attachment 8528409 [details] [diff] [review] Catch HTTPErrors in slaveapi Review of attachment 8528409 [details] [diff] [review]: ----------------------------------------------------------------- http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=11a580f6b019f992cf3d4fe69bc4e7d033682e27
Attachment #8528409 -
Flags: checked-in+
Assignee | ||
Comment 21•10 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #18) > I see from the code it is essentially allowing an HTTP Error to occur, > without falling over, which can only be a good thing, unless anything later > depends on the data or the id being set. That said, if it isn't, I can't > imagine much harm could be done, because we would not be able to publish an > update to a bug without an id, so there should be no risk of damage. Worst > case, it would crash, which is what it would do at the moment if we had an > HTTP Error (without your fix) so worst case: no improvement; best case: not > throwing an exception and surviving. So r+... I think! We end up with a null entry for the bug, which is the same behavior we had before when bugzilla was returning 200 codes.
Assignee | ||
Comment 22•10 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #19) > IMO this would be best fixed in bzrest, since it has code to raise the > specific exceptions we already trap, but is now failing due to the break in > bmo expectations. I don't disagree, but I'm not adding a pull request to bhearsum in the critical path when we have a working fix. > Also, while you're here... theres 1073630 awaiting landing, which I can't > land myself until Bug 1103089 is resolved. Done.
Assignee | ||
Comment 23•10 years ago
|
||
Attachment #8528529 -
Flags: review?(bugspam.Callek)
Assignee | ||
Comment 24•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #22) > (In reply to Justin Wood (:Callek) from comment #19) > > IMO this would be best fixed in bzrest, since it has code to raise the > > specific exceptions we already trap, but is now failing due to the break in > > bmo expectations. > > I don't disagree, but I'm not adding a pull request to bhearsum in the > critical path when we have a working fix. Marking for follow-up: https://github.com/bhearsum/bzrest/blob/master/bzrest/client.py#L47
Assignee | ||
Comment 25•10 years ago
|
||
Comment on attachment 8528529 [details] [diff] [review] Bump slaveapi version in puppet Review of attachment 8528529 [details] [diff] [review]: ----------------------------------------------------------------- r+ from Callek in IRC. https://hg.mozilla.org/build/puppet/rev/94bac2d8b558
Attachment #8528529 -
Flags: review?(bugspam.Callek)
Attachment #8528529 -
Flags: review+
Attachment #8528529 -
Flags: checked-in+
Assignee | ||
Comment 26•10 years ago
|
||
This has been deployed now.
Assignee | ||
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•