Closed Bug 1057888 Opened 10 years ago Closed 8 years ago

Automate monthly graceful restart of buildbot masters

Categories

(Release Engineering :: General, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: coop)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3401] )

Attachments

(4 files)

Bug 1056348 means we should automate a monthly restart of all the buildbot masters. This is to prevent them getting all crufty and slow from using a lot of memory. Fabric can almost do this already, say on a Sunday afternoon when load is much reduced and I'm around to handle any issues. A set of calls using -j1 or -j2 for the different types of masters, run by cron.

We need a few things first
* fabric doesn't handle scheduler masters, because it requires http_port to be defined in production-masters.json. In this case it can continue on straight away to the buildbot-wrangler call
* a graceful_restart should disable the master in slavealloc, so that slaves don't connect to it if the shutdown takes some time
Depends on: 1057889
This lets us to things like this in parallel:

python manage_masters.py -f production-masters.json -R build -D aws-us-east-1 -j2 graceful_restart
python manage_masters.py -f production-masters.json -R build -D aws-us-west-2 -j2 graceful_restart
python manage_masters.py -f production-masters.json -R build -D scl3 -j2 graceful_restart

Still need the nice stuff in bug 1057889 for slavealloc enable/disable etc.
Attachment #8515688 - Flags: review?(coop)
s/checkout your options/check your options/.
Attached file Notes on restart 2/Nov
Gracefully restarted the masters today, they'll need doing again by Dec 4th.

I broke it up into multiple screen windows (see attachment) but it still took about 8 hours, with the long poles the -R build due to multi-hour jobs. Something like 3 of the AWS linux test masters got stuck between the last job finishing and actually doing a shutdown, so I manually did a 'kill <buildbot pid>' to unstick.
Comment on attachment 8515688 [details] [diff] [review]
[tools] Add --datacentre argument to manage_masters

Review of attachment 8515688 [details] [diff] [review]:
-----------------------------------------------------------------

A good start.
Attachment #8515688 - Flags: review?(coop) → review+
Comment on attachment 8515688 [details] [diff] [review]
[tools] Add --datacentre argument to manage_masters

https://hg.mozilla.org/build/tools/rev/323e342f5ab3
Attachment #8515688 - Flags: checked-in+
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3394]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3394] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3399]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3399] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3401]
I did a rolling restart today, but it occurred to me TCW's might be an time to do this every 6 weeks. Perhaps without graceful part of the restart if the jobs will retry afterwards.
The idea is to do the master restarts in the tree closing windows, and remind us by squawking a couple of days before. I won't claim this is a great plan, since we won't always have a TCW, but the current 31 days was chosen fairly arbitrarily and nagios has been very noisy today.

We could maybe switch to checking the memory size of the buildbot process instead, if we account for different workloads (eg try vs tests-linux64) and configured memory.
Attachment #8646290 - Flags: review?(coop)
Attachment #8646290 - Flags: feedback?(hwine)
Comment on attachment 8646290 [details] [diff] [review]
[puppet] Start warning after 5wks + 5 days, instead of 31 days

Review of attachment 8646290 [details] [diff] [review]:
-----------------------------------------------------------------

I'm fine with a longer timeout in general, but we really should figure out how to run this automatically. A six week cycle would be fine if it didn't always seem to match up with a release week too.
Attachment #8646290 - Flags: review?(coop) → review+
Per discussion with coop in irc, I've also changed the check and alert intervals to be 12 hours instead of every hour.

svn revision 106923.
Comment on attachment 8646290 [details] [diff] [review]
[puppet] Start warning after 5wks + 5 days, instead of 31 days

lgtm.

1) I think our release schedule will be fairly predictable, so shifting out of sync with release week is unlikely. However, is there any place we need to flag this for chemspill? (Or do we need a global "are we in chemspill mode" service that services and groups can query?)

2) We have 2 types of TCWs nowadays: traditional & "soft close". I agree with :coop that automation is most reliable. And I wonder if this is something we should also manually do during a traditional TCW. (roughly: since we're getting the oil changed, let's check the tire pressures.)
Attachment #8646290 - Flags: feedback?(hwine) → feedback+
See Also: → 1220296
Assignee: nobody → coop
Status: NEW → ASSIGNED
This is the script I've been using since this time last year (Dec 2014) to restart the masters during TCWs.
Attachment #8699514 - Flags: review?(kmoir)
Attachment #8699514 - Flags: review?(kmoir) → review+
Comment on attachment 8699514 [details] [diff] [review]
[tools] Add restart_masters.py script

https://hg.mozilla.org/build/tools/rev/77c3a3f50b3f
Attachment #8699514 - Flags: checked-in+
I currently have this setup on dev-master2. There's a disabled cron job for it on my personal account.

There are a couple of things I'd like to fix on this script before we run it automatically:

1) Remove need for config file - we use LDAP creds to lookup master IDs and change master enabled state. Is there another way we could do this?

2) Respect reconfig.lock files for reconfigs already in progress.
(In reply to Chris Cooper [:coop] from comment #14)
> 1) Remove need for config file - we use LDAP creds to lookup master IDs and
> change master enabled state. Is there another way we could do this?

Use https://pypi.python.org/pypi/keyring for the LDAP creds?
:coop is this bug FIXED (since we do this weekly now?)
Flags: needinfo?(coop)
Blocks: 1248257
(In reply to Justin Wood (:Callek) from comment #16)
> :coop is this bug FIXED (since we do this weekly now?)

There are potential improvements to process to be sure, but they can be follow-ups. Here's a short-list for my future bug filing:

* remove the need for as many credentials as possible, e.g. could we keep the buildbot master IDs in the production-masters.json file rather than needing to do an slavealloc API lookup, or is there an internal mirror of the API that doesn't require credentials?
* fix default logging which is very chatty right now. In general, we only want to know about the problem cases now that many of the kinks are worked out.
* put a cap on how long we wait for a master to restart, and then do a hard stop/start. 5 hours is probably a good max time.
* add papertrail alerts for problem master events

I'll write a HowTo doc about it today and make sure the buildduty team is aware.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Flags: needinfo?(coop)
Resolution: --- → FIXED
Blocks: 1249356
https://hg.mozilla.org/build/tools/rev/bb51b858180f5a17430bb079da8d4e5f5b4f8bb8
Bug 1057888 - add ability to pull credentials from file, add logging for masters that hit problems during restart - r=bustage
We hit some restart failures on bm05, bm06, bm51, bm52, bm53 this weekend. Looks like the masters shutdown but there was some kind of race and the script was still trying to gracefully stop them.
Depends on: 1253874
See Also: → 1256118
Blocks: 1275428
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: