Closed Bug 1072872 Opened 10 years ago Closed 10 years ago

Handle shutdown & restart of instances for EC2 maintenance

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

x86
All

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: coop)

Details

AWS has advised they are doing host maintenance:

One or more of your Amazon EC2 instances are scheduled to be rebooted for required host maintenance. The maintenance will occur sometime during the window provided for each instance. Each instance will experience a clean reboot and will be unavailable while the updates are applied to the underlying host. This generally takes no more than a few minutes to complete.

Each instance will return to normal operation after the reboot, and all instance configuration and data will be retained. If you have startup procedures that aren’t automated during your instance boot process, please remember that you will need to log in and run them.  We will need to do this maintenance update in the window provided.  You will not be able to stop/start or re-launch instances in order to avoid this maintenance update. 
----

This is a little different to the usual notifications, since it doesn't advise that you can do the shutdown/restart yourself to move to a different host, so we may need to handle it differently. They also link to http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html which says you can do the reboot yourself in the system-reboot case, but doesn't say if that clears the event.

To see affected instances:
https://console.aws.amazon.com/ec2/v2/home?region=us-west-2#Events:
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Events:

Lots of buildbot-masters, puppet, some misc, and lots of spots. Spread over several time windows, the first starts at 1900 Pacific on the 28th. Kinda need to use boto to dump all the info out, the console is rubbish for copy and paste.
I used this snippet to dump:

from boto.ec2 import connect_to_region

for r in ("us-east-1", "us-west-2"):
    conn = connect_to_region(r)
    print "Events in", r
    statuses = conn.get_all_instance_status()
    events = [s for s in statuses if getattr(s, "events")]
    ret = []
    for e in events:
        i = conn.get_only_instances(instance_ids=[e.id])[0]
        # ignore spot instances
        if i.spot_instance_request_id:
            continue
        ret.append((e.events[0].not_after, i.tags["Name"]))
    for d, n in sorted(ret):
        print d, n


Events in us-east-1
2014-09-27T12:00:00.000Z buildbot-master01
2014-09-27T12:00:00.000Z buildbot-master02
2014-09-27T12:00:00.000Z buildbot-master03
2014-09-27T12:00:00.000Z buildbot-master114
2014-09-27T12:00:00.000Z buildbot-master69
2014-09-27T12:00:00.000Z buildbot-master75
2014-09-27T12:00:00.000Z buildbot-master76
2014-09-27T12:00:00.000Z buildbot-master94
2014-09-27T12:00:00.000Z dev-bld-linux64-ec2-mgerva
2014-09-27T12:00:00.000Z releng-jenkins01
2014-09-29T12:00:00.000Z buildbot-master113
2014-09-29T12:00:00.000Z buildbot-master117
2014-09-30T12:00:00.000Z buildbot-master70
2014-09-30T12:00:00.000Z buildbot-master71
2014-09-30T12:00:00.000Z buildbot-master77
2014-09-30T12:00:00.000Z dev-linux64-ec2-kmoir
2014-09-30T12:00:00.000Z releng-puppet2

Events in us-west-2
2014-09-26T16:00:00.000Z buildbot-master04
2014-09-26T16:00:00.000Z buildbot-master05
2014-09-26T16:00:00.000Z buildbot-master06
2014-09-26T16:00:00.000Z buildbot-master118
2014-09-26T16:00:00.000Z buildbot-master91
2014-09-26T16:00:00.000Z releng-puppet1
2014-09-26T16:00:00.000Z releng-puppet2
2014-09-26T16:00:00.000Z spdy-test
2014-09-27T16:00:00.000Z buildbot-master115
2014-09-27T16:00:00.000Z buildbot-master73
2014-09-27T16:00:00.000Z buildbot-master74
2014-09-27T16:00:00.000Z buildbot-master78
2014-09-27T16:00:00.000Z buildbot-master79
2014-09-27T16:00:00.000Z vcssync-dev
2014-09-30T16:00:00.000Z buildbot-master72

Sounds that we should restart most of them today to avoid unexpected issues. :/

By restart I mean power off and power on to make sure instances don't start on the same hardware.

* Buildbot masters should be disabled in slavealloc, gracefully stopped, and only then restarted.
* Easier with puppet masters, they can be restarted anytime without any special preparation.
* not sure what to do with releng-jenkins01 and spdy-test
* probably not a big deal for the "dev" instances
I'll do this over the course of today.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
I restarted some of the instances. This is the list of events still to be done.

Events in us-east-1
2014-09-27T12:00:00.000Z buildbot-master01
2014-09-27T12:00:00.000Z buildbot-master02
2014-09-27T12:00:00.000Z buildbot-master03
2014-09-27T12:00:00.000Z buildbot-master114
2014-09-27T12:00:00.000Z buildbot-master69
2014-09-27T12:00:00.000Z buildbot-master75
2014-09-27T12:00:00.000Z buildbot-master76
2014-09-27T12:00:00.000Z buildbot-master94
2014-09-29T12:00:00.000Z buildbot-master113
2014-09-29T12:00:00.000Z buildbot-master117
2014-09-30T12:00:00.000Z buildbot-master70
2014-09-30T12:00:00.000Z buildbot-master71
2014-09-30T12:00:00.000Z buildbot-master77


2014-09-27T12:00:00.000Z releng-jenkins01


Events in us-west-2
2014-09-26T16:00:00.000Z buildbot-master04
2014-09-26T16:00:00.000Z buildbot-master05
2014-09-26T16:00:00.000Z buildbot-master06
2014-09-26T16:00:00.000Z buildbot-master118
2014-09-26T16:00:00.000Z buildbot-master91
2014-09-27T16:00:00.000Z buildbot-master115
2014-09-27T16:00:00.000Z buildbot-master73
2014-09-27T16:00:00.000Z buildbot-master74
2014-09-27T16:00:00.000Z buildbot-master78
2014-09-27T16:00:00.000Z buildbot-master79
2014-09-30T16:00:00.000Z buildbot-master72
Slightly updated script:

from boto.ec2 import connect_to_region

for r in ("us-east-1", "us-west-2"):
    conn = connect_to_region(r)
    print "Events in", r
    statuses = conn.get_all_instance_status()
    events = [s for s in statuses if getattr(s, "events")]
    ret = []
    for e in events:
        # skip completed
        if "[Completed]" in e.events[0].description:
            continue
        i = conn.get_only_instances(instance_ids=[e.id])[0]
        # ignore spot instances
        if i.spot_instance_request_id:
            continue
        ret.append((e.events[0].not_after, i.tags["Name"]))
    for d, n in sorted(ret):
        print d, n
Booo, from the email: "You will not be able to stop/start or re-launch instances in order to avoid this maintenance update." 

This is the updated list with time windows specified:


Events in us-east-1
2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master02
2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master03
2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master114
2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master75
2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master76
2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z buildbot-master94
2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z dev-linux64-ec2-jlund2
2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z releng-jenkins01
2014-09-27T06:00:00.000Z 2014-09-27T12:00:00.000Z releng-puppet1
2014-09-29T06:00:00.000Z 2014-09-29T12:00:00.000Z buildbot-master113
2014-09-29T06:00:00.000Z 2014-09-29T12:00:00.000Z buildbot-master117
2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master70
2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master71
2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master77
2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z releng-puppet2
Events in us-west-2
2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master04
2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master05
2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master06
2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master118
2014-09-26T12:00:00.000Z 2014-09-26T16:00:00.000Z buildbot-master91
2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master115
2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master73
2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master74
2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master78
2014-09-27T12:00:00.000Z 2014-09-27T16:00:00.000Z buildbot-master79
2014-09-30T12:00:00.000Z 2014-09-30T16:00:00.000Z buildbot-master72
Those times are over the map. We're going to randomly lose a bunch of capacity over the next 5 days.

Should we be proactive and try to graceful the affected masters a few hours before each of the scheduled event? First batch is tomorrow afternoon.
I think so. Or we can just disable them in slavealloc and let Amazon do their dirty^W work :)
(In reply to Rail Aliiev [:rail] from comment #5)
> Booo, from the email: "You will not be able to stop/start or re-launch
> instances in order to avoid this maintenance update." 

Bah! Sorry for missing that when filing this.
From #releng just now:

pmoore
15:45:16 rail: i'm kind of tempted to let it take care of itself - in that our throughput will be higher if we don't disable, but the noise may be higher in the trees
15:45:35 but wait times / end-to-end times will no doubt be slower if we start disabling hosts - i guess it is a tradeoff
15:45:41 what do you think?
15:45:52 maybe i'm just lazy ;)

rail
15:47:18 pmoore: I'm not a big fan of babysitting them, we can assume that that kind of thing can happen anytime :)
15:47:38 pmoore: in other words I agree with you

pmoore
15:47:42 :)
15:48:34 rail: laura will be happy too, it helps with our disaster recovery planning - nothing like having production disasters to discover where your points of failure are :)
Loose agreement from sheriffs - will notify them now!

======

However I'm totally fine with some breakage if it's to experiment with better workflows longer term :-)
Thank you for thinking outside of the box!

Best wishes,

Ed

======
bhearsum
16:03:20 thanks dustin
pmoore
16:04:03 edmorley: Tomcat|Sheriffduty: see https://bugzilla.mozilla.org/show_bug.cgi?id=1072872#c8 - buildbot masters disruption
16:04:45 edmorley: Tomcat|Sheriffduty: basically, we'd like to avoid babysitting masters, if possible - rather than disabling which will lower throughput

edmorley
16:05:01 sgtm

pmoore
16:05:13 edmorley: Tomcat|Sheriffduty: it might mean some extra noise on the trees - it is basically what we were planning to talk about on Monday, but has come a bit early :)
Masters started getting rebooted by Amazon. So far they've all come back into service on their own with only a minimal amount of queue cleanup to perform for buildduty.

Will continue to monitor. We may have more cleanup to do on Monday.
All good. Still to clear:

Events in us-east-1
2014-09-29T12:00:00.000Z buildbot-master113
2014-09-29T12:00:00.000Z buildbot-master117
2014-09-30T12:00:00.000Z buildbot-master70
2014-09-30T12:00:00.000Z buildbot-master71
2014-09-30T12:00:00.000Z buildbot-master77
2014-09-30T12:00:00.000Z releng-puppet2
Events in us-west-2
2014-09-27T16:00:00.000Z buildbot-master115
2014-09-27T16:00:00.000Z buildbot-master73
2014-09-27T16:00:00.000Z buildbot-master74
2014-09-27T16:00:00.000Z buildbot-master78
2014-09-27T16:00:00.000Z buildbot-master79
us-west-2 all done, just the 6 in us-east-1 to go.
I've done the monthly restart of the masters (but not the hosts), except for the 5 masters in us-east-1 (comment #13) since AWS will do it.
(In reply to Nick Thomas [:nthomas] from comment #15)
> I've done the monthly restart of the masters (but not the hosts), except for
> the 5 masters in us-east-1 (comment #13) since AWS will do it.

Awesome, thanks Nick for doing this!
Only 4 important instances left (all in use1):

2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master70
2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master71
2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master77
2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z releng-puppet2
(In reply to Rail Aliiev [:rail] from comment #17)
> Only 4 important instances left (all in use1):
> 
> 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master70
> 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master71
> 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z buildbot-master77
> 2014-09-30T06:00:00.000Z 2014-09-30T12:00:00.000Z releng-puppet2

These have all been rebooted now. Looks ok.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.