Closed Bug 1005133 Opened 10 years ago Closed 10 years ago

Reduce log retention on buildbot masters from 200 twistd.log files to 100 twistd.log files

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(1 file)

buildbotmaster-reduce-logs.patch 10 years ago Pete Moore [:pmoore][:pete] 456 bytes, patch	catlee : review+ pmoore : checked-in+	Details \| Diff \| Splinter Review

Pete Moore [:pmoore][:pete]

Assignee

Description

•

10 years ago

Attached patch buildbotmaster-reduce-logs.patch — Details — Splinter Review

Today buildbot-master73 ran low on disk space. Amazon machine, 15GB disk. Twistd.log files take up 10GB (200 logs * 50 MB).

Reducing this on all buildbot masters to 100 log files to free up 7.5GB disk space on all masters, approximately.

Attachment #8416580 - Flags: review?(catlee)

Attachment #8416580 - Flags: feedback?(mgervasini)

Chris AtLee [:catlee]

Updated

•

10 years ago

Attachment #8416580 - Flags: review?(catlee) → review+

Nick Thomas [:nthomas] (UTC+12)

Updated

•

10 years ago

Blocks: 856594

Pete Moore [:pmoore][:pete]

Assignee

Comment 1

•

10 years ago

https://hg.mozilla.org/build/puppet/rev/108e282e8642

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

10 years ago

Attachment #8416580 - Flags: checked-in+

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

10 years ago

While puppet will deploy this to the masters, we still have to restart each master to pick it up (reconfig doesn't notice, see bug 856594 comment #15). A rolling process of graceful shutdown and start is a good way to do that, taking out a single master for each type (eg try, tests1-windows, etc).

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Massimo Gervasini [:massimo]

Updated

•

10 years ago

Attachment #8416580 - Flags: feedback?(mgervasini)

Pete Moore [:pmoore][:pete]

Assignee

Comment 3

•

10 years ago

I disabled bm73 in slavealloc, gracefully restarted bm73, and then checked the logs. The old ones didn't get trashed, so I manually trashed them, after checking timestamps from files, i.e.:

for ((i=101; i<=200; i++)); do rm -f /builds/buildbot/build1/master/twistd.log.$i; done

I checked disk space, which went down from 13GB (88%) to 7.7GB(55%) usage after deleting the logs.

Then I reenabled in slavealloc, and bm73 has started taking jobs again:
http://buildbot-master73.srv.releng.usw2.mozilla.com:8001/buildslaves?no_builders=1

Puppet did successfully update buildbot.tac files, but I'll need to perform the same steps as above for all other buildbot masters too...

Pete Moore [:pmoore][:pete]

Assignee

Comment 4

•

10 years ago

This is going to be non-trivial to roll out, since we don't have a tried-and-tested mechanism for looping through all masters, shutting them down, performing a task (deleting old log files) and restarting them, in sequence, although the individual parts should not be too difficult to piece together.

It requires a careful balance of risk/danger versus time-to-implement.

Brainstorming the idea, I am considering taking an approach such as scripting up all the following tasks:

1) Disabling of 20% of each class of buildmaster
2) Several hours later, gracefully shutting the disabled buildbot masters down (using fabric buildfarm/maintenance tools)
3) Purging logs (like in comment 3)
4) Starting the buildbot masters back up
5) Reenabling in slavealloc

Assuming this goes well, I can then repeat 4 times.

Reasons for this approach:
  * if it goes bad, only 20% of the buildmasters are affected in any stage
  * disabling ahead of time, means steps 2-5 happen in a relatively short period of time, meaning relatively easy to monitor for issues
  * can be fully automated, if the first iteration is successful

Please note I am on PTO on Monday 17 May / Tuesday 18 May - so I can only start this on Wednesday 19 May. Unless Coop you want to pick this up on Monday?

Flags: needinfo?(coop)

Chris AtLee [:catlee]

Comment 5

•

10 years ago

fabric has a graceful_restart action that is easily extended to also enable/disable in slavealloc. e.g.

https://github.com/catlee/tools/compare/master...fabric#diff-8fd5c265985d88a725d36d2e7fddc64aR261

Chris Cooper [:coop] (he/him)

Comment 6

•

10 years ago

(In reply to Chris AtLee [:catlee] from comment #5)
> fabric has a graceful_restart action that is easily extended to also
> enable/disable in slavealloc. e.g.
> 
> https://github.com/catlee/tools/compare/master...fabric#diff-
> 8fd5c265985d88a725d36d2e7fddc64aR261

Hrmmm, it pains me a bit that those actions aren't landed. We never should have taken you guys out of the buildduty rotation! ;)

I've been iterating through the masters all day. I'll post a full list of which masters are done and which remain before I head out.

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 7

•

10 years ago

(In reply to Chris Cooper [:coop] from comment #6)
> I've been iterating through the masters all day. I'll post a full list of
> which masters are done and which remain before I head out.

All build and try masters are done.

Test masters are *mostly* done. Due to capacity issues and the sheer number of slaves involved, I have to tread carefully  with the linux test masters.

In progress: bm01, bm111 (

These two masters ^^ have been marked for clean shutdown and are currently draining their slaves/jobs. I have loop running in a screen session running on each master that will start the master process back up once it goes away, but the masters will still need to be re-enabled in slavealloc once that happens.

If you're feeling helpful, check the master links below to see whether the master has restarted properly (button will say "Clean Shutdown"), and then re-enable the master in slavealloc once it has restarted.

http://buildbot-master01.srv.releng.use1.mozilla.com:8201/
http://buildbot-master111.srv.releng.scl3.mozilla.com:8201/

Still to do tomorrow: bm03, bm112

Chris Cooper [:coop] (he/him)

Comment 8

•

10 years ago

(In reply to Chris Cooper [:coop] from comment #7) 
> Still to do tomorrow: bm03, bm112

All masters are done. Back to Pete to close this out as he sees fit.

Pete Moore [:pmoore][:pete]

Assignee

Comment 9

•

10 years ago

Thanks for sorting this out coop.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Reduce log retention on buildbot masters from 200 twistd.log files to 100 twistd.log files

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Comment 1

Updated

Updated

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Attachment

General

Description

File Name

Content Type