Closed Bug 1095300 Opened 10 years ago Closed 9 years ago

slaveapi should be able to determine whether a slave is currently running a job

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: philor, Assigned: coop)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4111])

      No description provided.
VS2013 made things faster, then we combined in js etc into libxul and now it's slower than we started ?
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4111]
Per IRC, wontfix this bug (philor said not to change the current logic here, at least if said logic applies to all slaves)

[21:48:58]	Callek	philor: hitting up https://bugzilla.mozilla.org/show_bug.cgi?id=1095300 now, whats your recommended "time to wait before graceful" knowing of course that a failed graceful will *also* take that long to be noticed
[21:49:13]	Callek	as in, max(+6 hours, whatever I set this time to)
[21:49:45]	Callek	(since 6 hours is the slaveapi timeout for gracefuls it can't verify)
[21:50:18]	Callek	philor: is 6 hours, or 7 hours your recommended time, is what I'm basically asking
[21:50:28]	Callek	philor: also this time is the same across all slaves/jobs
[21:50:45]	philor	Callek: my recommendation is that we build a tool to track end-to-end times
[21:50:53]	philor	and that we make it not the same across all slaves
[21:51:06]	philor	that's... what's that nice word?... suboptimal
[21:51:10]	Callek	philor: basically I'm asking what I can do right now to make things better
[21:51:21]	Callek	since right now its 5 hours
[21:51:50]	Callek	(I have other priorities that conflict with this atm, which means I can only devote a few minutes today to it)
[21:52:40]	philor	Callek: so, we have slave pools where no job ever takes longer than an hour, and we leave them idle for 5 hours, and then another 6 if they don't graceful, and one slave pool where jobs take 5.5 hours, and we thus lose them for 11 hours every day because they rarely graceful?
[21:52:47]	philor	Callek: please do nothing
[21:53:33]	Callek	philor: well if the graceful comes back as failed, early (which sadly is not that common) then we will reboot earlier than the 6 hours
[21:53:41]	Callek	and if it is successful we reboot right away
[21:54:04]	Callek	but yea its the 5 hours from "last job" --> "checking" that we deal with right now
[21:53:53]	philor	I can manually deal with Win build slaves better than every single pool can deal with even more idle time
[21:54:18]	Callek	ok, then I'll mark that bug wontfix, thanks
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Couldn't we run different instances of slaverebooter at different cadences, each using a different config file to exclude slave types not on that cadence?
Thanks to not really having a clear picture of what happens when, I think I underestimated how widespread this is.

It has less to do with builds which take longer than slaverebooter's idea of idle time than it has to do with how frequently we have running builds of any duration on a slave which had its previous build finish more than slaverebooter's idea of idle time ago.

What we do, every single day, is run out of demand as the US workday ends, build up a good percentage of the Windows build pool with idle times of 1 or 2 or 3 or 4 hours, and then as NZ and Japan and then Europe wake up and first-thing push the things they got review on, we set ourselves up for the quarter-till-1am slaverebooter run to happen while we have a lot of slaves running a build, and thus unable to immediately graceful, but with more than 5 hours since their last job because they started their current job with several hours of idle time.
(In reply to Phil Ringnalda (:philor) from comment #4)
> What we do, every single day, is run out of demand as the US workday ends,
> build up a good percentage of the Windows build pool with idle times of 1 or
> 2 or 3 or 4 hours, and then as NZ and Japan and then Europe wake up and
> first-thing push the things they got review on, we set ourselves up for the
> quarter-till-1am slaverebooter run to happen while we have a lot of slaves
> running a build, and thus unable to immediately graceful, but with more than
> 5 hours since their last job because they started their current job with
> several hours of idle time.

We should really have another check here, specifically we should be able to determine quickly whether a given slave is currently running a job via slaveapi.

We could gather this data by hitting the buildbot web interface (slow) or tailing the twistd.log on the slave or extending the buildbot slave code directly to provide a simple json endpoint.

This would allow smarter decisions to be made by slaverebooter, and also provide useful information to people using slave health.
Assignee: bugspam.Callek → nobody
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Summary: Adjust slaverebooter's idea of how long jobs take, since we're doing Win PGO builds in more than five hours now → slaveapi should be able to determine whether a slave is currently running a job
Assignee: nobody → coop
There is still no non-invasive way to do this, and slaverebooter isn't used any more.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → WONTFIX
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.