Closed Bug 1057888 Opened 10 years ago Closed 8 years ago

Automate monthly graceful restart of buildbot masters

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: coop)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3401] )

Attachments

(4 files)

[tools] Add --datacentre argument to manage_masters 10 years ago Nick Thomas [:nthomas] (UTC+12) 2.86 KB, patch	coop : review+ nthomas : checked-in+	Details \| Diff \| Splinter Review
Notes on restart 2/Nov 10 years ago Nick Thomas [:nthomas] (UTC+12) 2.01 KB, text/plain		Details
[puppet] Start warning after 5wks + 5 days, instead of 31 days 9 years ago Nick Thomas [:nthomas] (UTC+12) 867 bytes, patch	coop : review+ hwine : feedback+ nthomas : checked-in+	Details \| Diff \| Splinter Review
[tools] Add restart_masters.py script 9 years ago Chris Cooper [:coop] (he/him) 14.69 KB, patch	kmoir : review+ coop : checked-in+	Details \| Diff \| Splinter Review

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

10 years ago

Bug 1056348 means we should automate a monthly restart of all the buildbot masters. This is to prevent them getting all crufty and slow from using a lot of memory. Fabric can almost do this already, say on a Sunday afternoon when load is much reduced and I'm around to handle any issues. A set of calls using -j1 or -j2 for the different types of masters, run by cron.

We need a few things first
* fabric doesn't handle scheduler masters, because it requires http_port to be defined in production-masters.json. In this case it can continue on straight away to the buildbot-wrangler call
* a graceful_restart should disable the master in slavealloc, so that slaves don't connect to it if the shutdown takes some time

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

10 years ago

Depends on: 1057889

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

10 years ago

Attached patch [tools] Add --datacentre argument to manage_masters — Details — Splinter Review

This lets us to things like this in parallel:

python manage_masters.py -f production-masters.json -R build -D aws-us-east-1 -j2 graceful_restart
python manage_masters.py -f production-masters.json -R build -D aws-us-west-2 -j2 graceful_restart
python manage_masters.py -f production-masters.json -R build -D scl3 -j2 graceful_restart

Still need the nice stuff in bug 1057889 for slavealloc enable/disable etc.

Attachment #8515688 - Flags: review?(coop)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

10 years ago

s/checkout your options/check your options/.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

10 years ago

Attached file Notes on restart 2/Nov — Details

Gracefully restarted the masters today, they'll need doing again by Dec 4th.

I broke it up into multiple screen windows (see attachment) but it still took about 8 hours, with the long poles the -R build due to multi-hour jobs. Something like 3 of the AWS linux test masters got stuck between the last job finishing and actually doing a shutdown, so I manually did a 'kill <buildbot pid>' to unstick.

Chris Cooper [:coop] (he/him)

Assignee

Comment 4

•

10 years ago

Comment on attachment 8515688 [details] [diff] [review]
[tools] Add --datacentre argument to manage_masters

Review of attachment 8515688 [details] [diff] [review]:
-----------------------------------------------------------------

A good start.

Attachment #8515688 - Flags: review?(coop) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 5

•

10 years ago

Comment on attachment 8515688 [details] [diff] [review]
[tools] Add --datacentre argument to manage_masters

https://hg.mozilla.org/build/tools/rev/323e342f5ab3

Attachment #8515688 - Flags: checked-in+

:kanban-engops

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3394]

:kanban-engops

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3394] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3399]

:kanban-engops

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3399] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3401]

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 6

•

9 years ago

I did a rolling restart today, but it occurred to me TCW's might be an time to do this every 6 weeks. Perhaps without graceful part of the restart if the jobs will retry afterwards.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

9 years ago

Attached patch [puppet] Start warning after 5wks + 5 days, instead of 31 days — Details — Splinter Review

The idea is to do the master restarts in the tree closing windows, and remind us by squawking a couple of days before. I won't claim this is a great plan, since we won't always have a TCW, but the current 31 days was chosen fairly arbitrarily and nagios has been very noisy today.

We could maybe switch to checking the memory size of the buildbot process instead, if we account for different workloads (eg try vs tests-linux64) and configured memory.

Attachment #8646290 - Flags: review?(coop)

Attachment #8646290 - Flags: feedback?(hwine)

Chris Cooper [:coop] (he/him)

Assignee

Comment 8

•

9 years ago

Comment on attachment 8646290 [details] [diff] [review]
[puppet] Start warning after 5wks + 5 days, instead of 31 days

Review of attachment 8646290 [details] [diff] [review]:
-----------------------------------------------------------------

I'm fine with a longer timeout in general, but we really should figure out how to run this automatically. A six week cycle would be fine if it didn't always seem to match up with a release week too.

Attachment #8646290 - Flags: review?(coop) → review+

Amy Rich [:arr] [:arich]

Comment 9

•

9 years ago

Per discussion with coop in irc, I've also changed the check and alert intervals to be 12 hours instead of every hour.

svn revision 106923.

Hal Wine [:hwine] use NI!

Comment 10

•

9 years ago

Comment on attachment 8646290 [details] [diff] [review]
[puppet] Start warning after 5wks + 5 days, instead of 31 days

lgtm.

1) I think our release schedule will be fairly predictable, so shifting out of sync with release week is unlikely. However, is there any place we need to flag this for chemspill? (Or do we need a global "are we in chemspill mode" service that services and groups can query?)

2) We have 2 types of TCWs nowadays: traditional & "soft close". I agree with :coop that automation is most reliable. And I wonder if this is something we should also manually do during a traditional TCW. (roughly: since we're getting the oil changed, let's check the tire pressures.)

Hal Wine [:hwine] use NI!

Updated

•

9 years ago

Attachment #8646290 - Flags: feedback?(hwine) → feedback+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 11

•

9 years ago

Comment on attachment 8646290 [details] [diff] [review]
[puppet] Start warning after 5wks + 5 days, instead of 31 days

 https://hg.mozilla.org/build/puppet/rev/c943e66eeb25
 https://hg.mozilla.org/build/puppet/rev/95b053a7d355

Attachment #8646290 - Flags: checked-in+

Hal Wine [:hwine] use NI!

Updated

•

9 years ago

Blocks: 1197853

Hal Wine [:hwine] use NI!

Updated

•

9 years ago

Blocks: 1197857

Hal Wine [:hwine] use NI!

Updated

•

9 years ago

Updated

•

9 years ago

Assignee: nobody → coop

Status: NEW → ASSIGNED

Chris Cooper [:coop] (he/him)

Assignee

Comment 12

•

9 years ago

Attached patch [tools] Add restart_masters.py script — Details — Splinter Review

This is the script I've been using since this time last year (Dec 2014) to restart the masters during TCWs.

Attachment #8699514 - Flags: review?(kmoir)

Kim Moir [:kmoir] ET

Updated

•

9 years ago

Attachment #8699514 - Flags: review?(kmoir) → review+

Chris Cooper [:coop] (he/him)

Assignee

Comment 13

•

9 years ago

Comment on attachment 8699514 [details] [diff] [review]
[tools] Add restart_masters.py script

https://hg.mozilla.org/build/tools/rev/77c3a3f50b3f

Attachment #8699514 - Flags: checked-in+

Chris Cooper [:coop] (he/him)

Assignee

Comment 14

•

9 years ago

I currently have this setup on dev-master2. There's a disabled cron job for it on my personal account.

There are a couple of things I'd like to fix on this script before we run it automatically:

1) Remove need for config file - we use LDAP creds to lookup master IDs and change master enabled state. Is there another way we could do this?

2) Respect reconfig.lock files for reconfigs already in progress.

Hal Wine [:hwine] use NI!

Comment 15

•

8 years ago

(In reply to Chris Cooper [:coop] from comment #14)
> 1) Remove need for config file - we use LDAP creds to lookup master IDs and
> change master enabled state. Is there another way we could do this?

Use https://pypi.python.org/pypi/keyring for the LDAP creds?

Justin Wood (:Callek)

Comment 16

•

8 years ago

:coop is this bug FIXED (since we do this weekly now?)

Flags: needinfo?(coop)

Justin Wood (:Callek)

Updated

•

8 years ago

Blocks: 1248257

Chris Cooper [:coop] (he/him)

Assignee

Comment 17

•

8 years ago

(In reply to Justin Wood (:Callek) from comment #16)
> :coop is this bug FIXED (since we do this weekly now?)

There are potential improvements to process to be sure, but they can be follow-ups. Here's a short-list for my future bug filing:

* remove the need for as many credentials as possible, e.g. could we keep the buildbot master IDs in the production-masters.json file rather than needing to do an slavealloc API lookup, or is there an internal mirror of the API that doesn't require credentials?
* fix default logging which is very chatty right now. In general, we only want to know about the problem cases now that many of the kinks are worked out.
* put a cap on how long we wait for a master to restart, and then do a hard stop/start. 5 hours is probably a good max time.
* add papertrail alerts for problem master events

I'll write a HowTo doc about it today and make sure the buildduty team is aware.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Flags: needinfo?(coop)

Resolution: --- → FIXED

Chris Cooper [:coop] (he/him)

Assignee

Updated

•

8 years ago

Blocks: 1249356

Chris Cooper [:coop] (he/him)

Assignee

Comment 18

•

8 years ago

https://hg.mozilla.org/build/tools/rev/bb51b858180f5a17430bb079da8d4e5f5b4f8bb8
Bug 1057888 - add ability to pull credentials from file, add logging for masters that hit problems during restart - r=bustage

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 19

•

8 years ago

We hit some restart failures on bm05, bm06, bm51, bm52, bm53 this weekend. Looks like the masters shutdown but there was some kind of race and the script was still trying to gracefully stop them.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

8 years ago

Depends on: 1253874

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

8 years ago

Updated

•

8 years ago

Blocks: 1275428

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Tools → General

You need to log in before you can comment on or make changes to this bug.