Closed Bug 1166448 Opened 9 years ago Closed 9 years ago

extend cloud-tools to handle windows 2008r2 golden image and spot deployments

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Assigned: grenade)

References

Details

Attachments

(4 files, 3 obsolete files)

We need to extend cloud-tools (https://github.com/mozilla/build-cloud-tools) to handle deployment of 2008R2 AMIs as well as linux AMIs so that we can generate new golden images from cron as well as deploy to spot instances.

The golden image creation is handled through the aws_create_instance.py script in cron, and an example can be seen on aws-manager2.srv.releng.scl3.mozilla.com:/builds/aws_manager/bin/aws_manager-try-linux64-ec2-golden.sh

We'll need separate cron jobs for build and for try, just as with linux.

Rail, do you have any guidance to give Rob/Mark/Q on what changes might be needed there (e.g. do we use cloud-init in there and have to change it to ec2config?) as well as in the spot instance deployment? I'm not sure what scripts handle the latter.
Flags: needinfo?(rail)
I think we talked about this a couple of times on Vidyo. Let me summarize the main entry points that need to be touched:

* the script consumes configs like this: https://github.com/rail/build-cloud-tools/blob/master/configs/tst-win64. It may be outdated.
* https://github.com/rail/build-cloud-tools/blob/master/configs/tst-win64.user_data is used as user data template and populated with actual data in https://github.com/rail/build-cloud-tools/blob/master/cloudtools/scripts/aws_create_instance.py#L112-112
* assimilate_instance may need to to adjusted, esp https://github.com/rail/build-cloud-tools/blob/master/cloudtools/aws/instance.py#L92-92


When aws_watch_pending.py start instances, it populates user data using the same templates.

I hope this is something that lets you start hacking. I'll be glad to answer any further questions.
Flags: needinfo?(rail)
we need to define the naming for 2008 build/try.

I am temporarily working with bld-2008-ec2-golden and try-2008-ec2-golden. Another suggestion was: b-2008-ec2-golden & (y|z)-2008-ec2-golden.

Please comment if you have a preference.
PR created https://github.com/mozilla/build-cloud-tools/pull/84
Rail, please can you review the PR and merge/comment if appropriate?
Flags: needinfo?(rail)
I replied in the PR.
Flags: needinfo?(rail)
I took a look at the PR (and made some comments). It looks like a lot of the cruft from the tst-w64 stuff was still in residence there (including using the wrong base image). We definitely don't want to carry any of that over to the new stuff we're deploying.

Since :grenade is out the rest of this week, I've cloned the cloud-tools repo into buildduty's homedir on aws-manager1 to try to hack on it myself (easier to test that way).

I've split all the configs out into b-2008 and try-2008, but I need to touch base with rail because I'm getting an error on the deployment. It gets as far as spinning up the instance and running puppet (I can see the cert generation email), but it looks like it bombs out in the cleanup phase:


No hosts found. Please specify (single) host string for connection: Process LoggingProcess-1:
Traceback (most recent call last):
  File "/tools/python27/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "aws_create_instance.py", line 225, in run
    return super(LoggingProcess, self).run()
  File "/tools/python27/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "aws_create_instance.py", line 196, in create_instance
    ami_cleanup(mount_point="/", distro=config["distro"])
  File "/builds/aws_manager/cloud-tools/cloudtools/aws/ami.py", line 25, in ami_cleanup
    run('rm -rf %s' % (e,))
  File "/builds/aws_manager/lib/python2.7/site-packages/fabric/network.py", line 575, in host_prompting_wrapper
    host_string = raw_input("No hosts found. Please specify (single)"
EOFError: EOF when reading a line
Depends on: 1150908, 1173620
Flags: needinfo?(rail)
It tries to use fabric to connect to the host via SSH. I'm not sure if that's supported on Windows.
Flags: needinfo?(rail)
Attached patch windows-cloud-tools.diff (obsolete) — Splinter Review
I still need to talk to rail to figure out what the ami_configs files are used for (and what ami-ids should be in there), but I wanted to get my work so far up here and recorded so other folks can build off of it if necessary.
Depends on: 1173786
(In reply to Rail Aliiev [:rail] from comment #6)

Even though I was working out of a different directory, there was some hardcoded path to use the config file in /builds/aws_manager, that's why I was seeing the issue.  I started testing out of the live dir on aws-manager1, and I got it to instantiate, run puppet, and create a snapshot and AMI in use1 (I haven't bothered replicating to usw2). I haven't tested the resultant AMI yet, so I'm not sure if tit's functional, but the process to generate the AMI appears to be working, at least.

Base AMI:
AMI ID: ami-1d47b276
AMI Name: windows-2008r2-x86_64-hvm-base-2015-05-18

b-2008 golden image host:
hostname: b-2008-ec2-golden.build.releng.use1.mozilla.com
IP: 10.134.49.66 (like the other golden images, it instantiates in the srv vlan to minimize the chance of IP conflicts)

Resultant AMI:
AMI ID: ami-8b7783e0
AMI NAME: spot-b-2008-2015-06-11-14-17


The successful run looks like:

2015-06-11 07:10:29,415 - INFO - Sanity checking DNS entries...
2015-06-11 07:10:29,416 - INFO - Checking name conflicts for b-2008-ec2-golden
2015-06-11 07:10:58,593 - INFO - waiting for workers
2015-06-11 07:10:58,743 - INFO - Using IP 10.134.49.66
2015-06-11 07:10:59,347 - INFO - subnet subnet-35a9835e
2015-06-11 07:10:59,348 - INFO - ignore_subnet_check, usning subnet-35a9835e
2015-06-11 07:11:00,181 - INFO - instance Instance:i-7c30a7d5 created, waiting to come up
2015-06-11 07:11:31,308 - INFO - assimilating Instance:i-7c30a7d5
2015-06-11 07:11:31,566 - INFO - waiting for instance to shut down
2015-06-11 07:17:20,207 - INFO - clearing userData
2015-06-11 07:17:20,352 - INFO - starting instance
2015-06-11 07:17:20,764 - INFO - waiting for instance to start
2015-06-11 07:17:45,038 - INFO - Generating AMI spot-b-2008-2015-06-11-14-17
2015-06-11 07:17:45,038 - INFO - Distro win2008
2015-06-11 07:18:21,505 - INFO - Creating a snapshot
2015-06-11 07:20:58,373 - INFO - Creating AMI
2015-06-11 07:20:58,554 - INFO - Waiting...
2015-06-11 07:20:59,709 - INFO - AMI created
2015-06-11 07:20:59,709 - INFO - ID: ami-8b7783e0, name: spot-b-2008-2015-06-11-14-17
2015-06-11 07:20:59,836 - INFO - AMI spot-b-2008-2015-06-11-14-17 (ami-8b7783e0) is ready
2015-06-11 07:20:59,837 - WARNING - Terminating Instance:i-7c30a7d5
Probably you were using the virtualenv with vanilla cloud-tools installed.
Attached file b-2008.cmd.user_data (obsolete) —
Mark had been using a very different user_data file that kept the machine going until puppetization finished. Included here for reference.
Blocks: 1121023
Modify puppet to support the try-2008-ec2-* naming scheme and consolidate node defs for windows ec2 and datacenter nodes.
Attachment #8621215 - Flags: review?(mcornmesser)
Attachment #8621215 - Flags: review?(mcornmesser) → review+
After discussion with rail, I discovered what the files in ami_config were for. Those are the config files we use to take a third party AMI and turn it into our base ami. Since we're not using a third party ami (we're importing one from MDT), we don't actually need those files.

Also found out that new spot deploys know which ami to use based on the moz_type and moz_created ami tags. It uses which ever of the amis of the correct type is newest.
We're also going to need to update https://github.com/mozilla/build-cloud-tools/blob/master/cloudtools/slavealloc.py

The information needs to match the classification of the data in http://slavealloc.pvt.build.mozilla.org/api/slaves (easier to read in http://slavealloc.pvt.build.mozilla.org/api/slaves)

This means that:
 distro = win28k
 dc = the official amazon region names
 speed = we'll want to create a new designation for this in slavealloc (maybe r3.xlarge)
 bits = 64

The return value should be the host type that we're looking for in aws_watch_pending, e.g. b-2008 or try-2008.

We'll also need to modify https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg to map job types to moz_types. http://builddata.pub.build.mozilla.org/builddata/reports/allthethings.json lists all of the job types we'll need regexs for. We'll likely need everything that's shows up under builds (not tests) in a grep of platform and win.
aws_stop_idle will not be sufficient for terminating idle instances, so instead of making modifications to that, we're going to deprecate it and replace it with runner/idleizer per bug 1173945.
A wrinkle for creating golden ami's, apparently the base ami has a hardcoded hostname that it uses. We need to pass in the correct hostname, reboot, and then puppetize the machine. This configuration will be part of the user_data we pass for the golden-ami generation.

As a temporary work around, we could create two different base images for build and try, but we need to solve this problem for widescale deployment, regardless.
commands I used to fool puppett and runtime last round of testing (setx commands may not be necessary):

set FACTER_domain=build.releng.use1.mozilla.com  
set FACTER_hostname=b-2008-ec2-golden  
set FACTER_fqdn=b-2008-ec2-golden.build.releng.use1.mozilla.com  
set COMPUTERNAME=b-2008-ec2-golden

setx FACTER_domain build.releng.use1.mozilla.com  
setx FACTER_hostname b-2008-0011  
setx FACTER_fqdn b-2008-0011.build.releng.use1.mozilla.com  
setx COMPUTERNAME b-2008-ec2-golden
Q notes that those lines sometimes worked better if they had a space before the newline... Ah, Windows.

I've added those to the user_data (with the appropriate domains and hostnames) for the golden instance creation configs and am testing with b-2008-ec2-golden on i-2cb62685.


Mark also created a new ami, ami-8149bdea that has the hostname try-2008-ec2-golden.try.releng.use1.mozilla.com hardcoded in case Q's FACTER changes don't prove to be the magic that we need to make puppet work.
Attached file b-2008.cmd.user_data (obsolete) —
It turns out that ami-8149bdea didn't have the ec2config execute bit set, so I've de-registered that one.

Good news is that with some tweaks to the user_data files (putting the lines at the end of the set/setx commands is a *bad* thing and messes up the hostname, and we removed the exit line), we now have a process that builds the puppeted AMIs from the original base AMI.

AMI spot-b-2008-2015-06-12-01-24 (ami-73926518)
AMI spot-try-2008-2015-06-12-01-18 (ami-e9946382)

These need to be verified to see if they have the correct trust level and they produce useful instances (in particular if the execute bit is set for ec2config). If not, we have some possible leads for that. Q suggested that we might need to shut down the ec2config process, THEN copy in the config file that sets the execute bit, then do the shutdown and take the snapshot.
Attachment #8621054 - Attachment is obsolete: true
(In reply to Amy Rich [:arr] [:arich] from comment #15)

I think the watch_pending.cfg mapping for 2008 should look like:

        "^(TB )?WINNT (5.2|6.1|6.2)( x86-64)?(.*(?! try)).* (pgo-)?build": "b-2008",
        "^(TB )?WINNT (5.2|6.1|6.2)( x86-64)?.* nightly": "b-2008",
        "^(TB )?WINNT (5.2|6.1|6.2)( x86-64)? try": "try-2008"
        "^Firefox \\S+ win(32|64) l10n nightly": "b-2008"
        "^Thunderbird \\S+ win(32|64) l10n nightly": "b-2008"
        "^graphene_try\\S+_win(32|64)": "try-2008",

I'll need a sanity check on the python regexes by someone more versed in python than I am.

For rules, we're starting out with r3.xlarge and can always add more later. I'm not sure why we ignore us-east-1b and us-east-1e, but I've replicated that for windows. The bid price is the current ondemand price (per discussion with catlee).

        "rules": {
            "b-2008": [
                {"instance_type": "r3.xlarge",
                 "ignored_azs": ["us-east-1b", "us-east-1e"],
                 "performance_constant": 1,
                 "bid_price": 0.60}
            ],
            "try-2008": [
                {"instance_type": "r3.xlarge",
                 "ignored_azs": ["us-east-1b", "us-east-1e"],
                 "performance_constant": 1,
                 "bid_price": 0.60}
            ],

I've set the limits for all locations to 0 for b-2008 and try-2008 for the time being so we don't start accidentally spinning stuff up in aws after this lands.
Attachment #8620977 - Attachment is obsolete: true
Attachment #8621398 - Attachment is obsolete: true
This PR should be ready for review/merge now.
Previous commits have been squashed into one.
https://patch-diff.githubusercontent.com/raw/mozilla/build-cloud-tools/pull/84
Attachment #8624264 - Flags: review?(rail)
url in previous comment should have read: https://github.com/mozilla/build-cloud-tools/pull/84
Comment on attachment 8624264 [details] [diff] [review]
windows-golden-ami

I merged the PR
Attachment #8624264 - Flags: review?(rail)
sg-18a07677 and sg-84beade6 can be removed from the b-2008 and try-2008 config files due to bug 1173786.
Closing this bug as the Windows golden AMI creation piece and associated changes to build-cloud-tools is complete and committed. There is still plenty to do on the Puppet front, but there are other bugs covering that. Feel free to re-open if there are pieces in scope that I haven't considered.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
There is still work to be done for watch_pending.cfg and slavealloc.py
Blocks: 1183535
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
diff extracted from arr's changes at: aws-manager1.srv.releng.scl3.mozilla.com:/home/buildduty/cloud-tools-arr
The PR referenced above had a couple build/test failures in Travis. The first was down to some missing commas but after fixing that, there's a test failure that I am having difficulty understanding.

        if not self._mock_unsafe:
            if name.startswith(('assert', 'assret')):
>               raise AttributeError(name)
E               AttributeError: assert_has_called

The build failure is here:
 https://travis-ci.org/mozilla/build-cloud-tools/builds/71922371#L357

Do you understand what it means?
Flags: needinfo?(rail)
Looks like mock has changed, and because we don't pin it we get this error. We should either pin mock in tox.ini to a particular version or to figure out why the test fails with new version of mock.
Flags: needinfo?(rail)
Fixed in PR: https://github.com/mozilla/build-cloud-tools/pull/89
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: