Closed Bug 1467573 Opened 6 years ago Closed 6 years ago

the slave-health reboot action fails for any machine

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: riman, Unassigned)

Details

I have tried to reboot a few machines from slave-health but the action does not work.

 Reboot is failed -> Output [Errno 2] No such file or directory
https://dxr.mozilla.org/build-central/source/slave_health/js/slave_health.js#528-529

I could not found "/slaves" file in this PATH "/slave_health/json/" 

https://dxr.mozilla.org/build-central/source/slave_health/js/slave_health.js#5   404 Not Found:  https://secure.pub.build.mozilla.org/slaveapi/slaves/


Q:   Could you take a look, please?
Flags: needinfo?(q)
Oh, I misunderstood your question on slack! Q won't be able to help with this. I'm not sure who best to poke for issues with slaveapi, at the moment; certainly asking in #releng would be a good step (unless jordan has a better idea).
Flags: needinfo?(q)
Did a small investigation, and seems like the "slaveapi_baseurl" doesn't work, it's used in the "slave_shutdown_url" and "slave_reboot_url".

Is the slaveapi been decomisioned? or retired?
AFAIK slaveapi is around until ESR52 dies and takes buildbot with it in early September.
radu, ciduty: I believe slaveapi is busted due to the python upgrades work that was done throughout puppet. namely: https://github.com/mozilla-releng/build-puppet/pull/71/files#diff-95780eb30d106e421159cda544ff09ec

context:
•jlund> Jordan Lund looks like slaveapi is broken. possibly python upgrade related
12:24:03 ↔ bc nipped out  
12:30:49 
<aki> yeah, https://github.com/mozilla-releng/build-puppet/pull/71/files#diff-95780eb30d106e421159cda544ff09ec landed recently; we could back out the slaveapi portion

radu mentioned a good work around is to use inventory to find host information and reboot the machine manually. Decent stop gap but if reversing this patch "just works" I think it's worth doing so.

relevant slaveapi logs from /builds/slaveapi/prod/slaveapi.log on slaveapi1.srv.releng.scl3.mozilla.com:

085 2018-06-22 09:56:17,199 - ERROR - -=- - Exception on /slaves/t-xp32-ix-005 [GET]
34086 Traceback (most recent call last):
34087   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
34088
34089   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
34090     return f
34091   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
34092     self.jinja_env.filters[name or f.__name__] = f
34093   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
34094     """
34095   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
34096     the view, and further request handling is stopped.
34097   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 84, in view
34098     constructor of the class.
34099   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 149, in dispatch_request
34100     def dispatch_request(self, *args, **kwargs):
34101   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/web/slave.py", line 25, in get
34102     slave.load_all_info()
34103   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 38, in load_all_info
34104     Machine.load_all_info(self)
34105   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/machines/base.py", line 33, in load_all_info
34106     self.load_inventory_info()
34107   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 60, in load_inventory_info
34108     info = Machine.load_inventory_info(self)
34109   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/machines/base.py", line 43, in load_inventory_info
34110     info = inventory.get_system(self.fqdn)
34111   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/clients/inventory.py", line 99, in get_system
34112     result = requests.get(str(url), auth=auth).json()["objects"][0]
34113   File "/builds/slaveapi/prod/lib/python2.7/site-packages/requests/api.py", line 55, in get
34114     # avoid leaving sockets open which can trigger a ResourceWarning in some
34115   File "/builds/slaveapi/prod/lib/python2.7/site-packages/requests/api.py", line 44, in request
34116     :return: :class:`Response <Response>` object
34117   File "/builds/slaveapi/prod/lib/python2.7/site-packages/requests/sessions.py", line 361, in request
34118     #: representing multivalued query parameters.
34119   File "/builds/slaveapi/prod/lib/python2.7/site-packages/requests/sessions.py", line 464, in send
34120     :param timeout: (optional) How long to wait for the server to send
34121   File "/builds/slaveapi/prod/lib/python2.7/site-packages/requests/adapters.py", line 363, in send
34122     """
34123 SSLError: [Errno 2] No such file or director

34174 2018-06-22 12:10:18,984 - INFO - -=- - 10.22.81.90 - - [2018-06-22 12:10:18] "GET /slaves/health%20hg%20mozilla/actions/shutdown_buildsla      ve HTTP/1.1" 200 135 0.000960
34175
34176 2018-06-22 12:10:18,994 - ERROR - -=- - Exception on /slaves/health hg mozilla [GET]
34177 Traceback (most recent call last):
34178   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
34179
34180   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
34181     return f
34182   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
34183     self.jinja_env.filters[name or f.__name__] = f
34184   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
34185     """
34186   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
34187     the view, and further request handling is stopped.
34188   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 84, in view
34189     constructor of the class.
34190   File "/builds/slaveapi/prod/lib/python2.7/site-packages/flask/views.py", line 149, in dispatch_request
34191     def dispatch_request(self, *args, **kwargs):
34192   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/web/slave.py", line 24, in get
34193     slave = SlaveClass(slave)
34194   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/slave.py", line 24, in __init__
34195     Machine.__init__(self, name)
34196   File "/builds/slaveapi/prod/lib/python2.7/site-packages/slaveapi/machines/base.py", line 18, in __init__
34197     answer = resolver.query(name)
34198   File "/builds/slaveapi/prod/lib/python2.7/site-packages/dns/resolver.py", line 981, in query
34199     except dns.query.UnexpectedSource as ex:
34200   File "/builds/slaveapi/prod/lib/python2.7/site-packages/dns/resolver.py", line 910, in query
34201     if len(qname) > 1:
34202 NXDOMAIN


Looks like requests and dns now are not happy by quick glance. As aki mentions above, we could backout (revert) that patch which may include needing to reinstall the venv depending on how puppet handles the backout or we try to fix up slaveapi source. The former would be a good stop gap since Slaveapi dies in early Sept as fubar mentions in comment 4.

radu or someone from ciduty: could someone try to make a reverse patch of https://github.com/mozilla-releng/build-puppet/pull/71 in puppet and make a PR for review? You will want to revert only the parts that affect slaveapi.
Flags: needinfo?(riman)
Flags: needinfo?(ciduty)
I will be looking at coming up with a fix, instead of a backout.
Currently I'm looking to setup the local env, so I can do proper testing on this issue. 

I have also talked with :riman and if I won't be able to find a fix for the issue this shift, I will handover to Radu everything I found so he can continue the work.
Flags: needinfo?(riman)
Flags: needinfo?(ciduty)
>I have created a revers patch for slaveapi/files/requirements.txt and made a PR for review.
https://github.com/raduiman/build-puppet/pull/1/commits/921be40b58e18772678461069891162d93242909
Could you take a look, please? I did not find other files that affect slaveapi.

>Should I reverse slaverebooter/files/requirements.txt too? Can it be related to the reboot issue?

>And finally, should I change test/verify-requirements.sh to avoid checking the reversed files?
Flags: needinfo?(jlund)
(In reply to Radu Iman[:riman] from comment #7)
> >I have created a revers patch for slaveapi/files/requirements.txt and made a PR for review.
> https://github.com/raduiman/build-puppet/pull/1/commits/
> 921be40b58e18772678461069891162d93242909
> Could you take a look, please? I did not find other files that affect
> slaveapi.

awesome! can you make a PR from the mozilla-releng/build-puppet (upstream) repo. This PR is on your own account so merging it wouldn't land on mozilla-releng/build-puppet.

> 
> >Should I reverse slaverebooter/files/requirements.txt too? Can it be related to the reboot issue?

I can't recall if slaverebooter predates and is unrelated to slaveapi. I would reverse that file too. callek?

> 
> >And finally, should I change test/verify-requirements.sh to avoid checking the reversed files?

yes probably, good catch!

callek, can you help out radu here. You are more familiar with slaveapi and you reviewed the original patch. Ben is still out on leave.
Flags: needinfo?(jlund) → needinfo?(bugspam.Callek)
(In reply to Jordan Lund (:jlund) from comment #8)
> (In reply to Radu Iman[:riman] from comment #7)
> > >I have created a revers patch for slaveapi/files/requirements.txt and made a PR for review.
> > https://github.com/raduiman/build-puppet/pull/1/commits/
> > 921be40b58e18772678461069891162d93242909
> > Could you take a look, please? I did not find other files that affect
> > slaveapi.
> 
> awesome! can you make a PR from the mozilla-releng/build-puppet (upstream)
> repo. This PR is on your own account so merging it wouldn't land on
> mozilla-releng/build-puppet.
> 

Ok, I do think the requirements revert would be a great first step, I wonder if we have any logs/data somewhere about what specific versions were installer prior to that patch landing though (incase an unnamed dep bumped with all this)

A reinstall of the venv will be the next step if a basic puppet revert isn't enough.

> > 
> > >Should I reverse slaverebooter/files/requirements.txt too? Can it be related to the reboot issue?
> 
> I can't recall if slaverebooter predates and is unrelated to slaveapi. I
> would reverse that file too. callek?

Slaverebooter was roughly-speaking a cron that did the slave reboots for humans, and has been pretty defunct iirc. -- We shouldn't need to care about it.

> > 
> > >And finally, should I change test/verify-requirements.sh to avoid checking the reversed files?
> 
> yes probably, good catch!

++

> 
> callek, can you help out radu here. You are more familiar with slaveapi and
> you reviewed the original patch. Ben is still out on leave.

I've forgotten nearly all I once knew about slaveapi, but I'll certainly try to help where useful.
Flags: needinfo?(bugspam.Callek)
I have created a new pull request: https://github.com/mozilla-releng/build-puppet/pull/143
Ontop of the proposed change by Radu, I have added a blacklist on this SlaveAPI requirements.txt so that PyUp will not upgrade the deps anymore.

PR available here: https://github.com/mozilla-releng/build-puppet/pull/145
Slave Health doesn't exist anymore,
Closing the bug.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.