Closed
Bug 1072872
Opened 10 years ago
Closed 10 years ago
Handle shutdown & restart of instances for EC2 maintenance
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: coop)
Details
AWS has advised they are doing host maintenance: One or more of your Amazon EC2 instances are scheduled to be rebooted for required host maintenance. The maintenance will occur sometime during the window provided for each instance. Each instance will experience a clean reboot and will be unavailable while the updates are applied to the underlying host. This generally takes no more than a few minutes to complete. Each instance will return to normal operation after the reboot, and all instance configuration and data will be retained. If you have startup procedures that aren’t automated during your instance boot process, please remember that you will need to log in and run them. We will need to do this maintenance update in the window provided. You will not be able to stop/start or re-launch instances in order to avoid this maintenance update. ---- This is a little different to the usual notifications, since it doesn't advise that you can do the shutdown/restart yourself to move to a different host, so we may need to handle it differently. They also link to http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html which says you can do the reboot yourself in the system-reboot case, but doesn't say if that clears the event. To see affected instances: https://console.aws.amazon.com/ec2/v2/home?region=us-west-2#Events: https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Events: Lots of buildbot-masters, puppet, some misc, and lots of spots. Spread over several time windows, the first starts at 1900 Pacific on the 28th. Kinda need to use boto to dump all the info out, the console is rubbish for copy and paste.
Comment 1•10 years ago
|
||
I used this snippet to dump: from boto.ec2 import connect_to_region for r in ("us-east-1", "us-west-2"): conn = connect_to_region(r) print "Events in", r statuses = conn.get_all_instance_status() events = [s for s in statuses if getattr(s, "events")] ret = [] for e in events: i = conn.get_only_instances(instance_ids=[e.id])[0] # ignore spot instances if i.spot_instance_request_id: continue ret.append((e.events[0].not_after, i.tags["Name"])) for d, n in sorted(ret): print d, n Events in us-east-1 2014-09-27T12:00:00.000Z buildbot-master01 2014-09-27T12:00:00.000Z buildbot-master02 2014-09-27T12:00:00.000Z buildbot-master03 2014-09-27T12:00:00.000Z buildbot-master114 2014-09-27T12:00:00.000Z buildbot-master69 2014-09-27T12:00:00.000Z buildbot-master75 2014-09-27T12:00:00.000Z buildbot-master76 2014-09-27T12:00:00.000Z buildbot-master94 2014-09-27T12:00:00.000Z dev-bld-linux64-ec2-mgerva 2014-09-27T12:00:00.000Z releng-jenkins01 2014-09-29T12:00:00.000Z buildbot-master113 2014-09-29T12:00:00.000Z buildbot-master117 2014-09-30T12:00:00.000Z buildbot-master70 2014-09-30T12:00:00.000Z buildbot-master71 2014-09-30T12:00:00.000Z buildbot-master77 2014-09-30T12:00:00.000Z dev-linux64-ec2-kmoir 2014-09-30T12:00:00.000Z releng-puppet2 Events in us-west-2 2014-09-26T16:00:00.000Z buildbot-master04 2014-09-26T16:00:00.000Z buildbot-master05 2014-09-26T16:00:00.000Z buildbot-master06 2014-09-26T16:00:00.000Z buildbot-master118 2014-09-26T16:00:00.000Z buildbot-master91 2014-09-26T16:00:00.000Z releng-puppet1 2014-09-26T16:00:00.000Z releng-puppet2 2014-09-26T16:00:00.000Z spdy-test 2014-09-27T16:00:00.000Z buildbot-master115 2014-09-27T16:00:00.000Z buildbot-master73 2014-09-27T16:00:00.000Z buildbot-master74 2014-09-27T16:00:00.000Z buildbot-master78 2014-09-27T16:00:00.000Z buildbot-master79 2014-09-27T16:00:00.000Z vcssync-dev 2014-09-30T16:00:00.000Z buildbot-master72 Sounds that we should restart most of them today to avoid unexpected issues. :/ By restart I mean power off and power on to make sure instances don't start on the same hardware. * Buildbot masters should be disabled in slavealloc, gracefully stopped, and only then restarted. * Easier with puppet masters, they can be restarted anytime without any special preparation. * not sure what to do with releng-jenkins01 and spdy-test * probably not a big deal for the "dev" instances
Assignee | ||
Comment 2•10 years ago
|
||
I'll do this over the course of today.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Comment 3•10 years ago
|
||
I restarted some of the instances. This is the list of events still to be done. Events in us-east-1 2014-09-27T12:00:00.000Z buildbot-master01 2014-09-27T12:00:00.000Z buildbot-master02 2014-09-27T12:00:00.000Z buildbot-master03 2014-09-27T12:00:00.000Z buildbot-master114 2014-09-27T12:00:00.000Z buildbot-master69 2014-09-27T12:00:00.000Z buildbot-master75 2014-09-27T12:00:00.000Z buildbot-master76 2014-09-27T12:00:00.000Z buildbot-master94 2014-09-29T12:00:00.000Z buildbot-master113 2014-09-29T12:00:00.000Z buildbot-master117 2014-09-30T12:00:00.000Z buildbot-master70 2014-09-30T12:00:00.000Z buildbot-master71 2014-09-30T12:00:00.000Z buildbot-master77 2014-09-27T12:00:00.000Z releng-jenkins01 Events in us-west-2 2014-09-26T16:00:00.000Z buildbot-master04 2014-09-26T16:00:00.000Z buildbot-master05 2014-09-26T16:00:00.000Z buildbot-master06 2014-09-26T16:00:00.000Z buildbot-master118 2014-09-26T16:00:00.000Z buildbot-master91 2014-09-27T16:00:00.000Z buildbot-master115 2014-09-27T16:00:00.000Z buildbot-master73 2014-09-27T16:00:00.000Z buildbot-master74 2014-09-27T16:00:00.000Z buildbot-master78 2014-09-27T16:00:00.000Z buildbot-master79 2014-09-30T16:00:00.000Z buildbot-master72
Comment 4•10 years ago
|
||
Slightly updated script: from boto.ec2 import connect_to_region for r in ("us-east-1", "us-west-2"): conn = connect_to_region(r) print "Events in", r statuses = conn.get_all_instance_status() events = [s for s in statuses if getattr(s, "events")] ret = [] for e in events: # skip completed if "[Completed]" in e.events[0].description: continue i = conn.get_only_instances(instance_ids=[e.id])[0] # ignore spot instances if i.spot_instance_request_id: continue ret.append((e.events[0].not_after, i.tags["Name"])) for d, n in sorted(ret): print d, n
Comment 5•10 years ago
|
||
Booo, from the email: "You will not be able to stop/start or re-launch instances in order to avoid this maintenance update." This is the updated list with time windows specified: Events in us-east-1 2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master02 2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master03 2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master114 2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master75 2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master76 2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master94 2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z dev-linux64-ec2-jlund2 2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z releng-jenkins01 2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z releng-puppet1 2014-09-29T06:00:00.000Z 2014-09-29T12:00:00.000Z buildbot-master113 2014-09-29T06:00:00.000Z 2014-09-29T12:00:00.000Z buildbot-master117 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master70 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master71 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master77 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z releng-puppet2 Events in us-west-2 2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master04 2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master05 2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master06 2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master118 2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master91 2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master115 2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master73 2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master74 2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master78 2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master79 2014-09-30T12:00:00.000Z 2014-09-30T16:00:00.000Z buildbot-master72
Assignee | ||
Comment 6•10 years ago
|
||
Those times are over the map. We're going to randomly lose a bunch of capacity over the next 5 days. Should we be proactive and try to graceful the affected masters a few hours before each of the scheduled event? First batch is tomorrow afternoon.
Comment 7•10 years ago
|
||
I think so. Or we can just disable them in slavealloc and let Amazon do their dirty^W work :)
Reporter | ||
Comment 8•10 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #5) > Booo, from the email: "You will not be able to stop/start or re-launch > instances in order to avoid this maintenance update." Bah! Sorry for missing that when filing this.
Comment 9•10 years ago
|
||
From #releng just now: pmoore 15:45:16 rail: i'm kind of tempted to let it take care of itself - in that our throughput will be higher if we don't disable, but the noise may be higher in the trees 15:45:35 but wait times / end-to-end times will no doubt be slower if we start disabling hosts - i guess it is a tradeoff 15:45:41 what do you think? 15:45:52 maybe i'm just lazy ;) rail 15:47:18 pmoore: I'm not a big fan of babysitting them, we can assume that that kind of thing can happen anytime :) 15:47:38 pmoore: in other words I agree with you pmoore 15:47:42 :) 15:48:34 rail: laura will be happy too, it helps with our disaster recovery planning - nothing like having production disasters to discover where your points of failure are :)
Comment 10•10 years ago
|
||
Loose agreement from sheriffs - will notify them now! ====== However I'm totally fine with some breakage if it's to experiment with better workflows longer term :-) Thank you for thinking outside of the box! Best wishes, Ed ======
Comment 11•10 years ago
|
||
bhearsum 16:03:20 thanks dustin pmoore 16:04:03 edmorley: Tomcat|Sheriffduty: see https://bugzilla.mozilla.org/show_bug.cgi?id=1072872#c8 - buildbot masters disruption 16:04:45 edmorley: Tomcat|Sheriffduty: basically, we'd like to avoid babysitting masters, if possible - rather than disabling which will lower throughput edmorley 16:05:01 sgtm pmoore 16:05:13 edmorley: Tomcat|Sheriffduty: it might mean some extra noise on the trees - it is basically what we were planning to talk about on Monday, but has come a bit early :)
Assignee | ||
Comment 12•10 years ago
|
||
Masters started getting rebooted by Amazon. So far they've all come back into service on their own with only a minimal amount of queue cleanup to perform for buildduty. Will continue to monitor. We may have more cleanup to do on Monday.
Reporter | ||
Comment 13•10 years ago
|
||
All good. Still to clear: Events in us-east-1 2014-09-29T12:00:00.000Z buildbot-master113 2014-09-29T12:00:00.000Z buildbot-master117 2014-09-30T12:00:00.000Z buildbot-master70 2014-09-30T12:00:00.000Z buildbot-master71 2014-09-30T12:00:00.000Z buildbot-master77 2014-09-30T12:00:00.000Z releng-puppet2 Events in us-west-2 2014-09-27T16:00:00.000Z buildbot-master115 2014-09-27T16:00:00.000Z buildbot-master73 2014-09-27T16:00:00.000Z buildbot-master74 2014-09-27T16:00:00.000Z buildbot-master78 2014-09-27T16:00:00.000Z buildbot-master79
Reporter | ||
Comment 14•10 years ago
|
||
us-west-2 all done, just the 6 in us-east-1 to go.
Reporter | ||
Comment 15•10 years ago
|
||
I've done the monthly restart of the masters (but not the hosts), except for the 5 masters in us-east-1 (comment #13) since AWS will do it.
Comment 16•10 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #15) > I've done the monthly restart of the masters (but not the hosts), except for > the 5 masters in us-east-1 (comment #13) since AWS will do it. Awesome, thanks Nick for doing this!
Comment 17•10 years ago
|
||
Only 4 important instances left (all in use1): 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master70 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master71 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master77 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z releng-puppet2
Comment 18•10 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #17) > Only 4 important instances left (all in use1): > > 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master70 > 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master71 > 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master77 > 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z releng-puppet2 These have all been rebooted now. Looks ok.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•