Closed
Bug 1057888
Opened 10 years ago
Closed 8 years ago
Automate monthly graceful restart of buildbot masters
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: coop)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3401] )
Attachments
(4 files)
2.86 KB,
patch
|
coop
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
2.01 KB,
text/plain
|
Details | |
867 bytes,
patch
|
coop
:
review+
hwine
:
feedback+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
14.69 KB,
patch
|
kmoir
:
review+
coop
:
checked-in+
|
Details | Diff | Splinter Review |
Bug 1056348 means we should automate a monthly restart of all the buildbot masters. This is to prevent them getting all crufty and slow from using a lot of memory. Fabric can almost do this already, say on a Sunday afternoon when load is much reduced and I'm around to handle any issues. A set of calls using -j1 or -j2 for the different types of masters, run by cron. We need a few things first * fabric doesn't handle scheduler masters, because it requires http_port to be defined in production-masters.json. In this case it can continue on straight away to the buildbot-wrangler call * a graceful_restart should disable the master in slavealloc, so that slaves don't connect to it if the shutdown takes some time
Reporter | ||
Comment 1•10 years ago
|
||
This lets us to things like this in parallel: python manage_masters.py -f production-masters.json -R build -D aws-us-east-1 -j2 graceful_restart python manage_masters.py -f production-masters.json -R build -D aws-us-west-2 -j2 graceful_restart python manage_masters.py -f production-masters.json -R build -D scl3 -j2 graceful_restart Still need the nice stuff in bug 1057889 for slavealloc enable/disable etc.
Attachment #8515688 -
Flags: review?(coop)
Reporter | ||
Comment 2•10 years ago
|
||
s/checkout your options/check your options/.
Reporter | ||
Comment 3•10 years ago
|
||
Gracefully restarted the masters today, they'll need doing again by Dec 4th. I broke it up into multiple screen windows (see attachment) but it still took about 8 hours, with the long poles the -R build due to multi-hour jobs. Something like 3 of the AWS linux test masters got stuck between the last job finishing and actually doing a shutdown, so I manually did a 'kill <buildbot pid>' to unstick.
Assignee | ||
Comment 4•10 years ago
|
||
Comment on attachment 8515688 [details] [diff] [review] [tools] Add --datacentre argument to manage_masters Review of attachment 8515688 [details] [diff] [review]: ----------------------------------------------------------------- A good start.
Attachment #8515688 -
Flags: review?(coop) → review+
Reporter | ||
Comment 5•10 years ago
|
||
Comment on attachment 8515688 [details] [diff] [review] [tools] Add --datacentre argument to manage_masters https://hg.mozilla.org/build/tools/rev/323e342f5ab3
Attachment #8515688 -
Flags: checked-in+
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3394]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3394] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3399]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3399] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/3401]
Reporter | ||
Comment 6•9 years ago
|
||
I did a rolling restart today, but it occurred to me TCW's might be an time to do this every 6 weeks. Perhaps without graceful part of the restart if the jobs will retry afterwards.
Reporter | ||
Comment 7•9 years ago
|
||
The idea is to do the master restarts in the tree closing windows, and remind us by squawking a couple of days before. I won't claim this is a great plan, since we won't always have a TCW, but the current 31 days was chosen fairly arbitrarily and nagios has been very noisy today. We could maybe switch to checking the memory size of the buildbot process instead, if we account for different workloads (eg try vs tests-linux64) and configured memory.
Attachment #8646290 -
Flags: review?(coop)
Attachment #8646290 -
Flags: feedback?(hwine)
Assignee | ||
Comment 8•9 years ago
|
||
Comment on attachment 8646290 [details] [diff] [review] [puppet] Start warning after 5wks + 5 days, instead of 31 days Review of attachment 8646290 [details] [diff] [review]: ----------------------------------------------------------------- I'm fine with a longer timeout in general, but we really should figure out how to run this automatically. A six week cycle would be fine if it didn't always seem to match up with a release week too.
Attachment #8646290 -
Flags: review?(coop) → review+
Comment 9•9 years ago
|
||
Per discussion with coop in irc, I've also changed the check and alert intervals to be 12 hours instead of every hour. svn revision 106923.
Comment 10•9 years ago
|
||
Comment on attachment 8646290 [details] [diff] [review] [puppet] Start warning after 5wks + 5 days, instead of 31 days lgtm. 1) I think our release schedule will be fairly predictable, so shifting out of sync with release week is unlikely. However, is there any place we need to flag this for chemspill? (Or do we need a global "are we in chemspill mode" service that services and groups can query?) 2) We have 2 types of TCWs nowadays: traditional & "soft close". I agree with :coop that automation is most reliable. And I wonder if this is something we should also manually do during a traditional TCW. (roughly: since we're getting the oil changed, let's check the tire pressures.)
Updated•9 years ago
|
Attachment #8646290 -
Flags: feedback?(hwine) → feedback+
Reporter | ||
Comment 11•9 years ago
|
||
Comment on attachment 8646290 [details] [diff] [review] [puppet] Start warning after 5wks + 5 days, instead of 31 days https://hg.mozilla.org/build/puppet/rev/c943e66eeb25 https://hg.mozilla.org/build/puppet/rev/95b053a7d355
Attachment #8646290 -
Flags: checked-in+
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → coop
Status: NEW → ASSIGNED
Assignee | ||
Comment 12•9 years ago
|
||
This is the script I've been using since this time last year (Dec 2014) to restart the masters during TCWs.
Attachment #8699514 -
Flags: review?(kmoir)
Updated•9 years ago
|
Attachment #8699514 -
Flags: review?(kmoir) → review+
Assignee | ||
Comment 13•9 years ago
|
||
Comment on attachment 8699514 [details] [diff] [review] [tools] Add restart_masters.py script https://hg.mozilla.org/build/tools/rev/77c3a3f50b3f
Attachment #8699514 -
Flags: checked-in+
Assignee | ||
Comment 14•9 years ago
|
||
I currently have this setup on dev-master2. There's a disabled cron job for it on my personal account. There are a couple of things I'd like to fix on this script before we run it automatically: 1) Remove need for config file - we use LDAP creds to lookup master IDs and change master enabled state. Is there another way we could do this? 2) Respect reconfig.lock files for reconfigs already in progress.
Comment 15•8 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #14) > 1) Remove need for config file - we use LDAP creds to lookup master IDs and > change master enabled state. Is there another way we could do this? Use https://pypi.python.org/pypi/keyring for the LDAP creds?
Comment 16•8 years ago
|
||
:coop is this bug FIXED (since we do this weekly now?)
Flags: needinfo?(coop)
Assignee | ||
Comment 17•8 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #16) > :coop is this bug FIXED (since we do this weekly now?) There are potential improvements to process to be sure, but they can be follow-ups. Here's a short-list for my future bug filing: * remove the need for as many credentials as possible, e.g. could we keep the buildbot master IDs in the production-masters.json file rather than needing to do an slavealloc API lookup, or is there an internal mirror of the API that doesn't require credentials? * fix default logging which is very chatty right now. In general, we only want to know about the problem cases now that many of the kinks are worked out. * put a cap on how long we wait for a master to restart, and then do a hard stop/start. 5 hours is probably a good max time. * add papertrail alerts for problem master events I'll write a HowTo doc about it today and make sure the buildduty team is aware.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Flags: needinfo?(coop)
Resolution: --- → FIXED
Assignee | ||
Comment 18•8 years ago
|
||
https://hg.mozilla.org/build/tools/rev/bb51b858180f5a17430bb079da8d4e5f5b4f8bb8 Bug 1057888 - add ability to pull credentials from file, add logging for masters that hit problems during restart - r=bustage
Reporter | ||
Comment 19•8 years ago
|
||
We hit some restart failures on bm05, bm06, bm51, bm52, bm53 this weekend. Looks like the masters shutdown but there was some kind of race and the script was still trying to gracefully stop them.
Updated•7 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•