Closed Bug 1005133 Opened 10 years ago Closed 10 years ago

Reduce log retention on buildbot masters from 200 twistd.log files to 100 twistd.log files

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(1 file)

Today buildbot-master73 ran low on disk space. Amazon machine, 15GB disk. Twistd.log files take up 10GB (200 logs * 50 MB).

Reducing this on all buildbot masters to 100 log files to free up 7.5GB disk space on all masters, approximately.
Attachment #8416580 - Flags: review?(catlee)
Attachment #8416580 - Flags: feedback?(mgervasini)
Attachment #8416580 - Flags: review?(catlee) → review+
Attachment #8416580 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
While puppet will deploy this to the masters, we still have to restart each master to pick it up (reconfig doesn't notice, see bug 856594 comment #15). A rolling process of graceful shutdown and start is a good way to do that, taking out a single master for each type (eg try, tests1-windows, etc).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #8416580 - Flags: feedback?(mgervasini)
I disabled bm73 in slavealloc, gracefully restarted bm73, and then checked the logs. The old ones didn't get trashed, so I manually trashed them, after checking timestamps from files, i.e.:

for ((i=101; i<=200; i++)); do rm -f /builds/buildbot/build1/master/twistd.log.$i; done

I checked disk space, which went down from 13GB (88%) to 7.7GB(55%) usage after deleting the logs.

Then I reenabled in slavealloc, and bm73 has started taking jobs again:
http://buildbot-master73.srv.releng.usw2.mozilla.com:8001/buildslaves?no_builders=1

Puppet did successfully update buildbot.tac files, but I'll need to perform the same steps as above for all other buildbot masters too...
This is going to be non-trivial to roll out, since we don't have a tried-and-tested mechanism for looping through all masters, shutting them down, performing a task (deleting old log files) and restarting them, in sequence, although the individual parts should not be too difficult to piece together.

It requires a careful balance of risk/danger versus time-to-implement.

Brainstorming the idea, I am considering taking an approach such as scripting up all the following tasks:

1) Disabling of 20% of each class of buildmaster
2) Several hours later, gracefully shutting the disabled buildbot masters down (using fabric buildfarm/maintenance tools)
3) Purging logs (like in comment 3)
4) Starting the buildbot masters back up
5) Reenabling in slavealloc

Assuming this goes well, I can then repeat 4 times.

Reasons for this approach:
  * if it goes bad, only 20% of the buildmasters are affected in any stage
  * disabling ahead of time, means steps 2-5 happen in a relatively short period of time, meaning relatively easy to monitor for issues
  * can be fully automated, if the first iteration is successful

Please note I am on PTO on Monday 17 May / Tuesday 18 May - so I can only start this on Wednesday 19 May. Unless Coop you want to pick this up on Monday?
Flags: needinfo?(coop)
fabric has a graceful_restart action that is easily extended to also enable/disable in slavealloc. e.g.

https://github.com/catlee/tools/compare/master...fabric#diff-8fd5c265985d88a725d36d2e7fddc64aR261
(In reply to Chris AtLee [:catlee] from comment #5)
> fabric has a graceful_restart action that is easily extended to also
> enable/disable in slavealloc. e.g.
> 
> https://github.com/catlee/tools/compare/master...fabric#diff-
> 8fd5c265985d88a725d36d2e7fddc64aR261

Hrmmm, it pains me a bit that those actions aren't landed. We never should have taken you guys out of the buildduty rotation! ;)

I've been iterating through the masters all day. I'll post a full list of which masters are done and which remain before I head out.
Flags: needinfo?(coop)
(In reply to Chris Cooper [:coop] from comment #6)
> I've been iterating through the masters all day. I'll post a full list of
> which masters are done and which remain before I head out.

All build and try masters are done.

Test masters are *mostly* done. Due to capacity issues and the sheer number of slaves involved, I have to tread carefully  with the linux test masters.

In progress: bm01, bm111 (

These two masters ^^ have been marked for clean shutdown and are currently draining their slaves/jobs. I have loop running in a screen session running on each master that will start the master process back up once it goes away, but the masters will still need to be re-enabled in slavealloc once that happens.

If you're feeling helpful, check the master links below to see whether the master has restarted properly (button will say "Clean Shutdown"), and then re-enable the master in slavealloc once it has restarted.

http://buildbot-master01.srv.releng.use1.mozilla.com:8201/
http://buildbot-master111.srv.releng.scl3.mozilla.com:8201/

Still to do tomorrow: bm03, bm112
(In reply to Chris Cooper [:coop] from comment #7) 
> Still to do tomorrow: bm03, bm112

All masters are done. Back to Pete to close this out as he sees fit.
Thanks for sorting this out coop.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: