Closed Bug 1138672 Opened 9 years ago Closed 9 years ago

vlan request - move bld-lion-r5-[007-015] from build pool and servo-lion-r5-[001,002] from servo pool both to try pool

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: dividehex)

References

Details

I want to get this filed but we can't action it until we disable these machines first.

We should try to time this so we have 10 machines disabled for least amount of time as possible. I'll start looking into what else needs to be done for bug 1137047

these hosts are currently $HOST.build.releng.scl3.mozilla.com but we will need them to be $HOST.try.releng.scl3.mozilla.com and set up like bld-lion-r5-[016-036] are.
Blocks: 1137047
This will require:

* a vlan change from netops/dcops (if this is really time critical, we should try to set a specific time with them to make sure Van will be onsite)
* hostname changes in inventory
* SREG and CNAME modifications in inventory
* dhcp_scope changes in inventory
* nagios changes
* removal/move in deploy studio
* reimage
* any additional rleeng modifications (buildbot configs, slavealloc configs, etc) after the machines are back up.
Assignee: relops → jwatkins
> * any additional rleeng modifications (buildbot configs, slavealloc configs,
> etc) after the machines are back up.

buildbot configs patch is here: https://bugzilla.mozilla.org/show_bug.cgi?id=1137047#c4

I can do the slavealloc bits: disable the slaves before we start and update fqdn's when this bug is finished

van: hi :) what's your availability between now and the near short term to help with this? This is not time critical but doing this all at the same time seems to make sense rather than disabling now.
Flags: needinfo?(vle)
I have two other minis in bug 1100386 that we might as well tackle in this batch.

servo-lion-r5-001 -> bld-lion-r5-095 (build)
servo-lion-r5-002 -> bld-lion-r5-096 (try)

Can I ask you to take care of these machines at the same time, please?
(In reply to Amy Rich [:arich] [:arr] from comment #1)

> * removal/move in deploy studio
> * reimage

Since we have issues with using deploystudio across vlans, we may need the deploystudio server to follow them to the try vlan temporarily
(In reply to Jake Watkins [:dividehex] from comment #4)

This shouldn't be an issue. The only problems we had imaging were on the srv network.
>van: hi :) what's your availability between now and the near short term to help with this? This is not >time critical but doing this all at the same time seems to make sense rather than disabling now.

:jlund, i can work on these hosts tomorrow 3/4/15. i'll be traveling to our pops in the bay area today to install some new routers. i'll ping you or someone in #releng before i start.
(In reply to Van Le [:van] from comment #6)
> >van: hi :) what's your availability between now and the near short term to help with this? This is not >time critical but doing this all at the same time seems to make sense rather than disabling now.
> 
> :jlund, i can work on these hosts tomorrow 3/4/15. i'll be traveling to our
> pops in the bay area today to install some new routers. i'll ping you or
> someone in #releng before i start.

great, thanks! I'll be around PT normal hours. I'll need about 1.5 hour notice so I can disable the slaves before we start and let them finish their current builds.
I'll also need some time to prep and test inventory and nagios.  Here is the ss for the changes:

https://docs.google.com/a/mozilla.com/spreadsheets/d/1FxyiAoZWzV3UICEy2S0aLgWlgXXEN7CQaaMxCNPD2yU/edit?usp=sharing
(In reply to Jake Watkins [:dividehex] from comment #8)
> I'll also need some time to prep and test inventory and nagios.  Here is the
> ss for the changes:
> 
> https://docs.google.com/a/mozilla.com/spreadsheets/d/
> 1FxyiAoZWzV3UICEy2S0aLgWlgXXEN7CQaaMxCNPD2yU/edit?usp=sharing

okay. I am going to disable the slaves at 1200 PT and then Van is going to ping me at 1400 PT to start his work.

from irc: 10:21:00 <van> just so we're on the same page, all i need to do is change vlans, then reimage right?

van: I think so, you will want to confirm with dividehex as there may be some steps inbtween switching vlans and reimaging

dividehex: how much time do you need before and after van starts? Do the machines need to be disabled (not running build jobs) for all of your work?
Flags: needinfo?(jwatkins)
(In reply to Jordan Lund (:jlund) from comment #9)
> (In reply to Jake Watkins [:dividehex] from comment #8)

> dividehex: how much time do you need before and after van starts?Do the
> machines need to be disabled (not running build jobs) for all of your work?

YES. Jobs need to be completed before nagios or inventory is update.

I have patches prepped for nagios and inventory.  So I won't need much time once the builds jobs are completed.  I'll need you to ping me when they are done.

So the process is something like this:
jobs complete ->
patch nagios to remove hosts ->
update inventory ->
delete old hosts from DS and enable default ds group ->
switch vlans/ports ->
reimage (no more than 5 at a time) ->
move default ds group back ->
patch nagios with new hostnames ->
enable new try slaves to take builds
Flags: needinfo?(jwatkins)
(In reply to Jake Watkins [:dividehex] from comment #10)
> (In reply to Jordan Lund (:jlund) from comment #9)
> > (In reply to Jake Watkins [:dividehex] from comment #8)
> 
> > dividehex: how much time do you need before and after van starts?Do the
> > machines need to be disabled (not running build jobs) for all of your work?
> 
> YES. Jobs need to be completed before nagios or inventory is update.
> 
> I have patches prepped for nagios and inventory.  So I won't need much time
> once the builds jobs are completed.  I'll need you to ping me when they are
> done.
> 
> So the process is something like this:
> jobs complete ->
> patch nagios to remove hosts ->
> update inventory ->
> delete old hosts from DS and enable default ds group ->
> switch vlans/ports ->
> reimage (no more than 5 at a time) ->
> move default ds group back ->
> patch nagios with new hostnames ->
> enable new try slaves to take builds

sounds good. slaves have started disabling will ping once they are done

dividehex, van: re coop's request above, can we do this at the same time:
servo-lion-r5-001 -> bld-lion-r5-095 (build)
servo-lion-r5-002 -> bld-lion-r5-096 (try)
Summary: vlan request - move bld-lion-r5-[006-015] machines from prod build pool to try build pool → vlan request - move bld-lion-r5-[007-015] from build pool and servo-lion-r5-[001,002] from servo pool both to try pool
we are now planning to move 9 build pool machines and 2 servo machines to try pool:

bld-lion-r5-007
bld-lion-r5-008
bld-lion-r5-009
bld-lion-r5-010
bld-lion-r5-011
bld-lion-r5-012
bld-lion-r5-013
bld-lion-r5-014
bld-lion-r5-015
servo-lion-r5-001
servo-lion-r5-002

their new homes have been reflected in google spread sheet: https://docs.google.com/a/mozilla.com/spreadsheets/d/1FxyiAoZWzV3UICEy2S0aLgWlgXXEN7CQaaMxCNPD2yU/edit?usp=sharing
Nagios and inventory have been updated.  It will take about 10~15 mins for dhcp and dns to propagate.
move + reimage completed. please let me know of any issues.

vans-MacBook-Pro:~ vle$ fping < tester
bld-lion-r5-007.try.releng.scl3.mozilla.com is alive
bld-lion-r5-008.try.releng.scl3.mozilla.com is alive
bld-lion-r5-009.try.releng.scl3.mozilla.com is alive
bld-lion-r5-010.try.releng.scl3.mozilla.com is alive
bld-lion-r5-011.try.releng.scl3.mozilla.com is alive
bld-lion-r5-012.try.releng.scl3.mozilla.com is alive
bld-lion-r5-013.try.releng.scl3.mozilla.com is alive
bld-lion-r5-014.try.releng.scl3.mozilla.com is alive
bld-lion-r5-015.try.releng.scl3.mozilla.com is alive
bld-lion-r5-095.try.releng.scl3.mozilla.com is alive
bld-lion-r5-096.try.releng.scl3.mozilla.com is alive
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(vle)
Resolution: --- → FIXED
colo-trip: --- → scl3
whoops, i thought this was a dcops bug. i'll reopen and let jake close when he confirms.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Van Le [:van] from comment #15)
> whoops, i thought this was a dcops bug. i'll reopen and let jake close when
> he confirms.

great! once we wait to get confirmation from Jake, we can enable these slaves again

coop: in case we require these in the morning before I start Pacific time, to enable we need to land and reconfig:
https://bugzilla.mozilla.org/attachment.cgi?id=8573011&action=edit
https://bugzilla.mozilla.org/attachment.cgi?id=8573015&action=edit

otherwise I will do this first thing assuming this bug is resolved when I come on line.
Flags: needinfo?(coop)
(In reply to Jordan Lund (:jlund) from comment #16)

> great! once we wait to get confirmation from Jake, we can enable these
> slaves again

Sorry :jlund, didn't realize you were waiting on me.  There is no other validation I need to do once they are reimaged and puppetized.  And I see puppet certs were generated for them successfully. Nagios checks have also been enabled for them.

You are clear to enable them to take builds.
(In reply to Jake Watkins [:dividehex] from comment #17)
> (In reply to Jordan Lund (:jlund) from comment #16)
> 
> > great! once we wait to get confirmation from Jake, we can enable these
> > slaves again
> 
> Sorry :jlund, didn't realize you were waiting on me.  There is no other
> validation I need to do once they are reimaged and puppetized.  And I see
> puppet certs were generated for them successfully. Nagios checks have also
> been enabled for them.
> 
> You are clear to enable them to take builds.

awesome. thanks!
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Flags: needinfo?(coop)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.