Closed Bug 1126879 Opened 9 years ago Closed 7 years ago

slaveapi fails at filing tracking bugs when it wants to file an unreachable bug for a slave without a problem tracking bug

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: aobreja)

Details

Attachments

(3 files)

See, until the next time we toss history, https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-snow-r4&name=t-snow-r4-0011 with two reboots with the result "400 Client Error: Bad Request" before I figured out that it wanted to call the slave unreachable but was failing to file the tracking bug to make it blocked by the unreachable bug, then after I filed the tracker for it, the successful "Failed. Filed IT bug for reboot"
this is a regression in bmo im sure. a seamonkey irc bot (i dont control) had issues yesterday as well
> 400 Client Error: Bad Request

in order to look at this, we'll need to know what request the bot is making against bmo -- what webservice endpoint is it hitting, what method, and what parameters?
Flags: needinfo?(bugspam.Callek)
for example:

2015-01-28 09:35:01,262 - INFO - panda-0524 - Sending request: POST https://bugzilla.mozilla.org/rest/bug
2015-01-28 09:35:01,879 - ERROR - panda-0524 - Something went wrong while processing!
2015-01-28 09:35:01,879 - ERROR - panda-0524 - Traceback (most recent call last):
2015-01-28 09:35:01,879 - ERROR - panda-0524 -
2015-01-28 09:35:01,880 - ERROR - panda-0524 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/processor.py",
line 64, in _worker
2015-01-28 09:35:01,880 - ERROR - panda-0524 -     res, msg = action(slave, *args, **kwargs)
2015-01-28 09:35:01,880 - ERROR - panda-0524 -
2015-01-28 09:35:01,880 - ERROR - panda-0524 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/actions/reboot.
py", line 116, in reboot
2015-01-28 09:35:01,880 - ERROR - panda-0524 -     slave.reboot_bug = file_reboot_bug(slave)
2015-01-28 09:35:01,880 - ERROR - panda-0524 -
2015-01-28 09:35:01,880 - ERROR - panda-0524 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/bugzill
a.py", line 76, in file_reboot_bug
2015-01-28 09:35:01,880 - ERROR - panda-0524 -     resp = bugzilla_client.create_bug(data)
2015-01-28 09:35:01,880 - ERROR - panda-0524 -
2015-01-28 09:35:01,881 - ERROR - panda-0524 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line
55, in create_bug
2015-01-28 09:35:01,881 - ERROR - panda-0524 -     return self.request("POST", "bug", data)
2015-01-28 09:35:01,881 - ERROR - panda-0524 -
2015-01-28 09:35:01,881 - ERROR - panda-0524 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/bzrest/client.py", line
40, in request
2015-01-28 09:35:01,881 - ERROR - panda-0524 -     r.raise_for_status()
2015-01-28 09:35:01,881 - ERROR - panda-0524 -
2015-01-28 09:35:01,881 - ERROR - panda-0524 -   File "/builds/slaveapi/prod/lib/python2.7/site-packages/requests/models.py", lin
e 683, in raise_for_status
2015-01-28 09:35:01,881 - ERROR - panda-0524 -     raise HTTPError(http_error_msg, response=self)
2015-01-28 09:35:01,881 - ERROR - panda-0524 -
2015-01-28 09:35:01,881 - ERROR - panda-0524 - HTTPError: 400 Client Error: Bad Request

Which is:
http://mxr.mozilla.org/build/source/slaveapi/slaveapi/clients/bugzilla.py#66

Which is calling into https://github.com/bhearsum/bzrest/blob/master/bzrest/client.py

specifically its just calling a POST with that data: https://github.com/bhearsum/bzrest/blob/master/bzrest/client.py#L54
Flags: needinfo?(bugspam.Callek) → needinfo?(glob)
Specifically I suspect this is a regression from: bug 1124437 Backport upstream bug 1090275 to bmo/4.2 to whitelist webservice api methods
this doesn't appear to be related to bug 1124437 - i'm able to create bugs via rest without issue.

bugzilla will be returning the reason for the failure in its json response:

{"documentation":"http://www.bugzilla.org/docs/tip/en/html/api/","code":32000,"error":true,"message":"The version value 'other' is not active."}

however the library used here is catching and dealing with the http/400 result first, which results in it dropping the error message in favour of a generic "bad request" one.


my guess is the bot is setting the "blocks" field to a bug which doesn't exist.
Flags: needinfo?(glob)
Should be fixed with bzrest 0.9, which I just updated on prod slaveapi.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
It would be sweet to finally get this fixed, since I've generally rebooted a slave two or three times, and thus lost 24-48 hours of its life, before I finally notice it's not actually getting anywhere.
It would be sweet to finally get this fixed, since we now have employees doing buildduty, including doing my non-job when I'm on non-PTO, and since I didn't train them they don't know this bug exists.
Alin or Andrei should be able to tackle this.
Assignee: bugspam.Callek → nobody
Component: Tools → Buildduty
QA Contact: hwine → bugspam.Callek
Assignee: nobody → aobreja
This patch should upgrade bzrest to 0.9 version which could solve the problem with  calling the POST "return self.request("POST", "bug", data)".

A recent example of this problem could be found on t-yosemite-r7-0387.

The Puppet repository can be found here: https://github.com/mozilla/build-puppet

Callek what could be the risks if we do this upgrade on puppet for bzrest?
Comment on attachment 8784864 [details] [diff] [review]
bug1126879_puppet.patch

Review of attachment 8784864 [details] [diff] [review]:
-----------------------------------------------------------------

As I said in c#6 -- I updated bzrest on prod and thought that fixed it. I didn't realize we had bzrest==.7 pinned here. so do eet.
Attachment #8784864 - Flags: review+
(In reply to Justin Wood (:Callek) from comment #12)
> Comment on attachment 8784864 [details] [diff] [review]
> bug1126879_puppet.patch
> 
> Review of attachment 8784864 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> As I said in c#6 -- I updated bzrest on prod and thought that fixed it. I
> didn't realize we had bzrest==.7 pinned here. so do eet.

Ahh and based on that comment too, we need to update: https://github.com/mozilla/build-slaveapi/blob/master/setup.py#L20

Otherwise we'll fail to install things right.

So steps:

* Update github's slaveapi repo (version bump for bzrest and slaveapi itself).
* Package it up and deploy to relengweb's pypi and puppet's pypi mirrors
* Deploy this puppet patch + a version bump for slaveapi.
Attachment #8786738 - Flags: review?(bugspam.Callek) → review+
Callek I don't have merge rights for this patch, I get " Only those with write access to this repository can merge pull requests."
Can you merge this patch for me?

thanks
Flags: needinfo?(bugspam.Callek)
Done, thanks.
Flags: needinfo?(bugspam.Callek)
> * Package it up and deploy to relengweb's pypi and puppet's pypi mirrors
> * Deploy this puppet patch + a version bump for slaveapi.

Done this part.
If we expect this to work now, it doesn't.
Attached file logs_bug1126879.txt
Seems that upgrading Bzrest to 0.9 from 0.7 did not solved the issue  ,is the same issue as in Comment 3 with :
Sending request: POST https://bugzilla.mozilla.org/rest/bug
The problem can be seen on t-w864-ix-230 and t-w864-ix-199.

Callek do you have any suggestions here?
Flags: needinfo?(bugspam.Callek)
Two thoughts:

* I thought there was a puppet issue with the version bumps, did that get sorted out, if not then we're not actually running the new code.
* Slaveapi needs to be manually restarted after the ver bumps, since there is no soft-reset and it retains state in memory (ala: the History is not flushed to disk anywhere).
Flags: needinfo?(bugspam.Callek)
Callek do have any other suggestions on this bug?  Andrei mentioned this morning that he was still stuck on this bug.
Flags: needinfo?(bugspam.Callek)
Nothing offhand, tracebacks would be useful if one exists in the logs. 

Also running similar commands with slave's venv of slaveapi+bzrest to validate that it can indeed reach bmo with the creds it has and is able to submit a bug in a similar fashion.

If we feel this is important enough and our buildduty team can't decipher the app, I can look into it but its a big context switch so I'd like :coop to confirm with me that he does want me to look in for debugging sake, if I am to do so.
Flags: needinfo?(bugspam.Callek)
Apparently it just needed dhouse to restart slaveapi (a couple of times) after he did a kernel upgrade on it, since it just filed some tracking bugs for the first time in just over two years.
Status: REOPENED → RESOLVED
Closed: 9 years ago7 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: