1095300 - slaveapi should be able to determine whether a slave is currently running a job

Reporter

Description

•

10 years ago

      No description provided.

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

10 years ago

VS2013 made things faster, then we combined in js etc into libxul and now it's slower than we started ?

Justin Wood (:Callek)

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4111]

Justin Wood (:Callek)

Comment 2

•

10 years ago

Per IRC, wontfix this bug (philor said not to change the current logic here, at least if said logic applies to all slaves)

[21:48:58]	Callek	philor: hitting up https://bugzilla.mozilla.org/show_bug.cgi?id=1095300 now, whats your recommended "time to wait before graceful" knowing of course that a failed graceful will *also* take that long to be noticed
[21:49:13]	Callek	as in, max(+6 hours, whatever I set this time to)
[21:49:45]	Callek	(since 6 hours is the slaveapi timeout for gracefuls it can't verify)
[21:50:18]	Callek	philor: is 6 hours, or 7 hours your recommended time, is what I'm basically asking
[21:50:28]	Callek	philor: also this time is the same across all slaves/jobs
[21:50:45]	philor	Callek: my recommendation is that we build a tool to track end-to-end times
[21:50:53]	philor	and that we make it not the same across all slaves
[21:51:06]	philor	that's... what's that nice word?... suboptimal
[21:51:10]	Callek	philor: basically I'm asking what I can do right now to make things better
[21:51:21]	Callek	since right now its 5 hours
[21:51:50]	Callek	(I have other priorities that conflict with this atm, which means I can only devote a few minutes today to it)
[21:52:40]	philor	Callek: so, we have slave pools where no job ever takes longer than an hour, and we leave them idle for 5 hours, and then another 6 if they don't graceful, and one slave pool where jobs take 5.5 hours, and we thus lose them for 11 hours every day because they rarely graceful?
[21:52:47]	philor	Callek: please do nothing
[21:53:33]	Callek	philor: well if the graceful comes back as failed, early (which sadly is not that common) then we will reboot earlier than the 6 hours
[21:53:41]	Callek	and if it is successful we reboot right away
[21:54:04]	Callek	but yea its the 5 hours from "last job" --> "checking" that we deal with right now
[21:53:53]	philor	I can manually deal with Win build slaves better than every single pool can deal with even more idle time
[21:54:18]	Callek	ok, then I'll mark that bug wontfix, thanks

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

Chris Cooper [:coop] (he/him)

Assignee

Comment 3

•

10 years ago

Couldn't we run different instances of slaverebooter at different cadences, each using a different config file to exclude slave types not on that cadence?

Phil Ringnalda (:philor)

Reporter

Comment 4

•

10 years ago

Thanks to not really having a clear picture of what happens when, I think I underestimated how widespread this is.

It has less to do with builds which take longer than slaverebooter's idea of idle time than it has to do with how frequently we have running builds of any duration on a slave which had its previous build finish more than slaverebooter's idea of idle time ago.

What we do, every single day, is run out of demand as the US workday ends, build up a good percentage of the Windows build pool with idle times of 1 or 2 or 3 or 4 hours, and then as NZ and Japan and then Europe wake up and first-thing push the things they got review on, we set ourselves up for the quarter-till-1am slaverebooter run to happen while we have a lot of slaves running a build, and thus unable to immediately graceful, but with more than 5 hours since their last job because they started their current job with several hours of idle time.

Chris Cooper [:coop] (he/him)

Assignee

Comment 5

•

10 years ago

(In reply to Phil Ringnalda (:philor) from comment #4)
> What we do, every single day, is run out of demand as the US workday ends,
> build up a good percentage of the Windows build pool with idle times of 1 or
> 2 or 3 or 4 hours, and then as NZ and Japan and then Europe wake up and
> first-thing push the things they got review on, we set ourselves up for the
> quarter-till-1am slaverebooter run to happen while we have a lot of slaves
> running a build, and thus unable to immediately graceful, but with more than
> 5 hours since their last job because they started their current job with
> several hours of idle time.

We should really have another check here, specifically we should be able to determine quickly whether a given slave is currently running a job via slaveapi.

We could gather this data by hitting the buildbot web interface (slow) or tailing the twistd.log on the slave or extending the buildbot slave code directly to provide a simple json endpoint.

This would allow smarter decisions to be made by slaverebooter, and also provide useful information to people using slave health.

Assignee: bugspam.Callek → nobody

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Summary: Adjust slaverebooter's idea of how long jobs take, since we're doing Win PGO builds in more than five hours now → slaveapi should be able to determine whether a slave is currently running a job

Chris Cooper [:coop] (he/him)

Assignee

Updated

•

9 years ago

Assignee: nobody → coop

Chris Cooper [:coop] (he/him)

Assignee

Comment 6

•

9 years ago

There is still no non-invasive way to do this, and slaverebooter isn't used any more.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 9 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Tools → General

Bugzilla

Quick Search

slaveapi should be able to determine whether a slave is currently running a job

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: philor, Assigned: coop)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4111])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Updated