Closed Bug 1367448 Opened 7 years ago Closed 5 years ago

support in existing tools to monitor state of hardware pools running taskcluster workers

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: kmoir, Unassigned)

References

Details

Attachments

(4 files, 8 obsolete files)

This is similar to bug 1364955 but a more short term solution
This also includes the ability to reboot hardware when it is down via a dashboard.
So I looked at the logs on for the number of reboots for our existing tools today

It looks like there are only an average of one or two reboots of yosemite machines today.  So if this work to implement tooling takes more than a week, perhaps we should consider going forward with the yosemite migration, and implement the monitoring, tooling, reboot capabilities in time for the windows migration.

Right now, I'm trying to get the existing tools to run on dev-master2 which has flows to allow us to connect to the buildbot masters.  However, it's missing some libraries on that machine so I'm working that out.
Assignee: nobody → kmoir
I have patches that I'm iterating on on my dev-master to show the pending counts. I have code that looks at the pending counts for levels 1-3 from this url and aggregates the total
https://queue.taskcluster.net/v1/pending/aws-provisioner-v1/gecko-3-b-macosx64

I was able to connect to the postgres db for greg's taskcluster analysis and run queries.

One question I have is whether pending counts for tc versions of these platforms should be considered separate from the machines running buildbot, or should they all show up under the existing header.
(In reply to Kim Moir [:kmoir] from comment #3)
> I have patches that I'm iterating on on my dev-master to show the pending
> counts. I have code that looks at the pending counts for levels 1-3 from
> this url and aggregates the total
> https://queue.taskcluster.net/v1/pending/aws-provisioner-v1/gecko-3-b-
> macosx64

By design in BB, the machine's that take Level 3 and 2 build jobs, are a different set than those that take level 1.  Different secrets, different trust in on-disk caches, etc.

Can we please confirm that the tooling will keep those both separate, and that the used pools are also separate?
(In reply to Kim Moir [:kmoir] from comment #3)
> I have patches that I'm iterating on on my dev-master to show the pending
> counts. I have code that looks at the pending counts for levels 1-3 from
> this url and aggregates the total
> https://queue.taskcluster.net/v1/pending/aws-provisioner-v1/gecko-3-b-
> macosx64
>

Those are only for the cross compiled builds in taskcluster provisioned within AWS.  I think the counts we should be looking at are for those with the provisioner "scl3-puppet" and worker type "gecko-t-osx-1010" (it'll be that worker type once bug 1368718 lands)
Depends on: 1368718
Although, to hook it up, you could use the worker type os-x-10-10-gw for now as that was the old one that we're replacing.
Okay, I will fix that to address the correct worker type.
So was trying to test my patches on dev-master2.  However, it only has flows to the buildbot db, not the slavealloc db. So I can't run the tool here to test.

So I thought that I would try testing on relengadm.  However it is missing a bunch of os packages like mysql-devel etc so my virtual env fails to install the required packages via pip. I don't have access to infra puppet which manages this machine.

Now testing on cruncher, different error messages here trying invoke the virtualenv

Callek or coop: how did you usually test changes for 

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/

Maybe I should just copy slave_health.py to a new file in the production env and redirect the output to a different dir to test my changes, not sure.
Flags: needinfo?(coop)
Flags: needinfo?(bugspam.Callek)
(In reply to Kim Moir [:kmoir] from comment #8)
> Maybe I should just copy slave_health.py to a new file in the production env
> and redirect the output to a different dir to test my changes, not sure.

This will work and has been done before, but I officially discourage messing around in production.

cruncher is not the intended place to test these changes any more, so I wouldn't bother futzing with the venv there. The correct spot for testing should be buildduty-tools (buildduty-tools.srv.releng.usw2.mozilla.com) these days.
Flags: needinfo?(coop)
Thanks coop.  I tried buildduty-tools.srv.releng.usw2.mozilla.com but it seems to lack flows to the buildbot dbs. I get timeouts connecting to the mysql databases for buildbot dbs. I'll go back on cruncher and run a copy of the script manually in my own home dir.  It won't rsync files or anything so it should be fine.
Attached patch bug1367448.patchSplinter Review
patch to add mac pending, next step is to add info for specific machines
Attached patch bug1367448-2.patch (obsolete) — Splinter Review
updated patches

Remaining work
patches to have postgresql-devel installed on the machine that runs the cron jobs for the dashboards
fix machine_health_for_type.py to output to a file
run machine_health_for_type.py before the main script runs or includes

machine_health_for_type.py was based off of garndt's work here https://gist.githubusercontent.com/gregarndt/83288a6b091719731a5934532063a0e2/raw/bb05e827c19aeb15f050a81abf443271c4e993eb/slave_health_for_type.py
Comment on attachment 8873546 [details] [diff] [review]
bug1367448.patch

this is just for the pending tests, more work to add pending builds but these are not the immediate target for migration
Attachment #8873546 - Flags: feedback?(bugspam.Callek)
Attached patch bug1367448puppet.patch (obsolete) — Splinter Review
patch to add postgresql-devel to cruncher so it can query tc db
Attached patch bug1367448puppet-2.patch (obsolete) — Splinter Review
Attachment #8873823 - Attachment is obsolete: true
Attached patch bug1367448-4.patch (obsolete) — Splinter Review
merged code that queries tc database into the same file that queries the bb data
Attachment #8873604 - Attachment is obsolete: true
Attachment #8873546 - Attachment is obsolete: true
Attachment #8873546 - Flags: feedback?(bugspam.Callek)
Attachment #8873932 - Flags: feedback?(bugspam.Callek)
Attachment #8873933 - Flags: feedback?(bugspam.Callek)
Comment on attachment 8873933 [details] [diff] [review]
bug1367448-4.patch

Review of attachment 8873933 [details] [diff] [review]:
-----------------------------------------------------------------

At a glance this looks ok, I'm not really familiar with postgres or the python package used for it.  I am sad that we need to connect to the tc database directly and can't find out this info via exposed TC API's though.
Attachment #8873933 - Flags: feedback?(bugspam.Callek) → feedback+
Attachment #8873932 - Flags: feedback?(bugspam.Callek) → feedback+
(In reply to Justin Wood (:Callek) from comment #17)
> Comment on attachment 8873933 [details] [diff] [review]
> bug1367448-4.patch
> 
> Review of attachment 8873933 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> At a glance this looks ok, I'm not really familiar with postgres or the
> python package used for it.  I am sad that we need to connect to the tc
> database directly and can't find out this info via exposed TC API's though.

We're both sad about not being able to use existing APIs for this.  Unfortunately we do not have queue introspection or information about work that workers have done exposed anywhere.  This database was stood up for a different purpose in mind that's also being used for this.  Ideally at some point someone can make a better service for consuming information from taskcluster and putting it behind an API to answer some common questions.  perhaps it's the app I'm using to feed this database, but I have not had a moment to add an API to it yet.
Attached patch bug1367448puppet-3.patch (obsolete) — Splinter Review
I tested this on a loaner and it turns out the version of postgresql-libs we have pinned is too old for the latest version of postgresql-devel
Attachment #8873933 - Attachment is obsolete: true
Attachment #8874427 - Flags: review?(bugspam.Callek)
Attachment #8874427 - Flags: review?(bugspam.Callek) → review+
So the problem with the postgresql-devel dependencies is that 

[root@cruncher-aws.srv.releng.usw2.mozilla.com ~]# yum list postgresql-libs
Installed Packages
postgresql-libs.x86_64                                                                     8.4.20-6.el6                                                                     @security_update_1319455

is installed

but and the postgresql-devel-8.4.20-1.el6_5.x86_64 package that is on our mirrors depends on 8.4.20-6.el5 of postgresql-libs 

And looking at the updates in puppet specified here modules/packages/manifests/security_updates.pp
  "postgresql-libs":
                      ensure => "8.4.20-6.el6";
  
the new version of postgresql-libs is specified, but not sure where it comes from as it is not listed on our mirrors

[kmoir@releng-puppet2.srv.releng.scl3.mozilla.com mirrors]$ find . -name postgresql-libs*
./centos/6.5/os/x86_64/Packages/postgresql-libs-8.4.18-1.el6_4.i686.rpm
./centos/6.5/os/x86_64/Packages/postgresql-libs-8.4.18-1.el6_4.x86_64.rpm
./centos/6.5/os/i386/Packages/postgresql-libs-8.4.18-1.el6_4.i686.rpm
./centos/6.5/updates/x86_64/Packages/postgresql-libs-8.4.20-1.el6_5.x86_64.rpm
./centos/6.5/updates/x86_64/Packages/postgresql-libs-8.4.20-1.el6_5.i686.rpm
./centos/6.5/updates/i386/Packages/postgresql-libs-8.4.20-1.el6_5.i686.rpm
So I found this bug 1334172
where the version of postgresql-libs was upgraded by dhouse in relops
then I found this repo  http://vault.centos.org/6.8/os/x86_64/Packages/

which has the version of postgresql-libs that is specified in puppet (8.4.20-6.el6) as noted above

So if I upgrade postgresql and postgresql-devel to these versions, I can install the psycopg2 libs via pip.

I'll update the puppet patch, not sure if this upgrade should apply to all machines or just cruncher.
Attachment #8874427 - Flags: checked-in+
Comment on attachment 8874427 [details] [diff] [review]
bug1367448-5.patch

reverted since the virtualenv is not finding the correct libraries and thus the script is unable to connect load the postgres libraries

I'm investigating a fix
Attachment #8874427 - Flags: checked-in+ → checked-in-
Attached patch bug1367448puppetpath.patch (obsolete) — Splinter Review
It looks like it needs the postgresql91-devel-9.1.24-2PGDG.rhel6.x86_64 version of the libs in order to install psycopg2 via pip in the virtualenv and we need to add /usr/pgsql-9.1/bin to the path.
Attached patch bug1367448-6.patch (obsolete) — Splinter Review
So the new postgresql91 libs required the db connection to be in  a different format which this patch reflects.  However, I am unable to connect to the postgresql db on cruncher (where the machine health page is installed).  I assume it is a network restriction.  I was testing it on other machines when I wrote the code because I was having such problems with getting the correct libraries installed there given the age of the os.  In any case, I see two options going forward
1) request that we be able to connect to the heroku db through mozilla security.  Given that that this server is emphemeral as I understand it and it's hostname is subject to change this is probably not great way foward
2) Create a local db to store the tc data that we are able to access from our machines that run tools

Callek, Greg thoughts?
Flags: needinfo?(garndt)
Attached patch bug1367448-6.patch (obsolete) — Splinter Review
Attachment #8875002 - Attachment is obsolete: true
standing up an individual instance of this should not be difficult.  If this app had an API that could be called instead of calling the DB directly, would that help?  Perhaps given a few hours of work I could do something like that
Flags: needinfo?(garndt)
An api I could call instead of querying a db would be fantastic.
Comment on attachment 8873546 [details] [diff] [review]
bug1367448.patch

I landed this part so we would at least have some progress on this bug and be able to see the total pending counts.  Once we have the tc api for the state of the individual machines I'll rework the other patches to include this individual machine state.  I checked and it is working fine and updating the status page.
Attachment #8873546 - Attachment is obsolete: false
Attachment #8873546 - Flags: checked-in+
Greg added an endpoint to his app that I can use i.e.
https://taskcluster-task-analysis.herokuapp.com/v1/worker-groups/scl3/workers/t-yosemite-r7-0040
and I'm testing patches to consume this
New patches to test garndt's new endpoint 

https://taskcluster-task-analysis.herokuapp.com/v1/worker-groups/scl3/workers/t-yosemite-r7-446/tasks 

I tested this in a separate directory on cruncher and the testing went well
Attachment #8875003 - Attachment is obsolete: true
Attachment #8875860 - Flags: review?(bugspam.Callek)
Comment on attachment 8875860 [details] [diff] [review]
bug1367448-7.patch

Review of attachment 8875860 [details] [diff] [review]:
-----------------------------------------------------------------

::: scripts/slave_health.py
@@ +103,5 @@
> +
> +
> +  #  # this should be a list that TC could be using that is not currently allocated
> +  #  # to buildbot
> +    machines = ["t-yosemite-r7-{:04d}".format(i) for i in range(40,50)]

I'm not fond of hardcoding the list here, at least make it a global array at the top of the file, that is then referenced here.
Attachment #8875860 - Flags: review?(bugspam.Callek) → review+
Thanks Callek for the review.

I will fix the global issues with the list of machines.

I landed the existing code to see if it worked

The tc machines are included in the status
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=t-yosemite-r7

I was able to reboot one of them
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=t-yosemite-r7&name=t-yosemite-r7-0041

[cltbld@t-yosemite-r7-0041.test.releng.scl3.mozilla.com ~]$ uptime
 8:07  up 5 mins, 2 users, load averages: 1.96 1.78 0.93

Not sure why the last 10 jobs history is wrong, will investigate.

In any case, we can reboot machines and see pending counts for mac tc workers as part of the overall pending count so that's great.
Flags: needinfo?(bugspam.Callek)
The logic for getting the list of last 10 jobs is embedded within the page:
https://hg.mozilla.org/build/slave_health/file/tip/slave.html#l155

which ends up calling:
https://hg.mozilla.org/build/slave_health/file/tip/js/slave_health.js#l632

Perhaps there is a way to say if the job history is empty, try to see if this is a taskcluster machine and has data.  If so, build the table from that.
Comment on attachment 8873932 [details] [diff] [review]
bug1367448puppet-2.patch

obselete, we don't need puppet patches anymore since we don't need to talk to dbs anymore
Attachment #8873932 - Attachment is obsolete: true
Attachment #8874021 - Attachment is obsolete: true
Attachment #8874950 - Attachment is obsolete: true
from #tcmigration
kmoir: it looks like you're not getting the last 10 jobs output for TC machines because the JS to get that info is embedded in the page: https://hg.mozilla.org/build/slave_health/file/tip/slave.html#l155
patch to make list of tc machines more global.  Also, I will update this list as we migrate machines to tc this week

re the issue of the machines last 100 jobs for tc, the current code currently talks to buildapi which doesn't have tc data so I have to write some code to query Greg's API for this info
Attachment #8876876 - Flags: checked-in+
Also, from irc
and there is no way to "disable" a machine other than hopping on it and killing generic-worker I think
Blocks: 1373289
No longer blocks: 1373289
Blocks: 1387051
Assignee: kmoir → nobody
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
I think this can be closed as it's based on slave-health which is dead.

In the new world, we rely on "Bug 1504331 - create a service that listens for events from hardware instances" and moonshot-hardware spreadsheet.

I believe we are also adding nagios checks onto the workers themselves to sanity check they are communicating with the Queue. cc'n fubar
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: