Closed Bug 1149580 Opened 9 years ago Closed 9 years ago

disable AMI generation

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(2 files)

This was done by hand, but puppet reverted it

---

since we don't yet have a known cause or fix for the AMI issues we hit today, I've disabled AMI generation on aws-manager2 via commenting out the following cron jobs in /etc/cron.d/:

puppetcheck
aws_manager-aws_publish_amis.py.cron
aws_manager-bld-linux64-ec2-golden.cron
aws_manager-try-linux64-ec2-golden.cron
aws_manager-tst-emulator64-ec2-golden.cron
aws_manager-tst-linux32-ec2-golden.cron
aws_manager-tst-linux64-ec2-golden.cron
Assignee: nobody → dustin
Attached patch bug1149580.patchSplinter Review
Attachment #8586156 - Flags: review?(catlee)
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[tst-linux64-ec2-golden]/File[/etc/cron.d/aws_manager-tst-linux64-ec2-golden.cron]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[aws_publish_amis.py]/File[/etc/cron.d/aws_manager-aws_publish_amis.py.cron]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[tst-emulator64-ec2-golden]/File[/etc/cron.d/aws_manager-tst-emulator64-ec2-golden.cron]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[try-linux64-ec2-golden]/File[/etc/cron.d/aws_manager-try-linux64-ec2-golden.cron]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[bld-linux64-ec2-golden]/File[/etc/cron.d/aws_manager-bld-linux64-ec2-golden.cron]/ensure: removed
Notice: /Stage[main]/Aws_manager::Install/Python::Virtualenv[/builds/aws_manager]/Python::Virtualenv::Package[/builds/aws_manager||cfn-pyplates==0.4.3]/Exec[pip /builds/aws_manager||cfn-pyplates==0.4.3]/returns: executed successfully
Notice: /Stage[main]/Aws_manager::Install/Exec[install-cloud-tools-dist]/returns: executed successfully
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[bld-linux64-ec2-golden]/File[/builds/aws_manager/bin/aws_manager-bld-linux64-ec2-golden.sh]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[tst-emulator64-ec2-golden]/File[/builds/aws_manager/bin/aws_manager-tst-emulator64-ec2-golden.sh]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[aws_publish_amis.py]/File[/builds/aws_manager/bin/aws_manager-aws_publish_amis.py.sh]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[tst-linux32-ec2-golden]/File[/builds/aws_manager/bin/aws_manager-tst-linux32-ec2-golden.sh]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[tst-linux64-ec2-golden]/File[/builds/aws_manager/bin/aws_manager-tst-linux64-ec2-golden.sh]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[try-linux64-ec2-golden]/File[/builds/aws_manager/bin/aws_manager-try-linux64-ec2-golden.sh]/ensure: removed
Notice: /Stage[main]/Aws_manager::Cron/Aws_manager::Crontask[tst-linux32-ec2-golden]/File[/etc/cron.d/aws_manager-tst-linux32-ec2-golden.cron]/ensure: removed
Attachment #8586156 - Flags: review?(catlee) → review+
I'll leave this open for the revert.
Assignee: dustin → nobody
Blocks: 1147853
after catlee disabled new AMI publishing I created a few loans (via puppetize) and I un-applied one puppet patch at a time. Unfortunetly, I was *not* able to get a green build:

my master/slaves: http://dev-master2.bb.releng.use1.mozilla.com:8038/builders/Ubuntu%20VM%2012.04%20mozilla-central%20opt%20test%20mochitest-3
my puppet env and revisions I tried: https://github.com/lundjordan/build-puppet/commits/ami-health-check

irc snippet
12:33:20 <jlund|buildduty> catlee: rail so wrt 'bad ami', I have not been able to green up the latest puppetized aws instance
12:33:30 <jlund|buildduty> catlee: rail here is my slaves/master: http://dev-master2.bb.releng.use1.mozilla.com:8038/builders/Ubuntu%20VM%2012.04%20mozilla-central%20opt%20test%20mochitest-3
12:33:44 <rail> that's really weird
12:33:52 <jlund|buildduty> catlee: rail here is the puppet patches I have pinned to their environments: https://github.com/lundjordan/build-puppet/commits/ami-health-check
12:34:46 <jlund|buildduty> (anything with REVERT are the patches I un-applied + the https://github.com/lundjordan/build-puppet/commit/f90c7f0923c981e8d7f0001b432b0661636bbb66 )
12:34:51 <jlund|buildduty> how should we proceed?
12:36:40 <Callek> my suggestion was "revert all those patches" and then enable ami generation, verify that things are good next day or two
12:37:11 <Callek> then reland those patches one at a time (not on friday) and regen ami's directly after landing, and let a day pass to verify no issue
12:37:40 <Callek> and once we identify the culprit we can revert it and proceed to drawing board with that patch
12:39:20 <catlee> jlund|buildduty: which AMI is your loaner baed on?
12:39:35 <catlee> jlund|buildduty: can you reproduce a green by not puppetizing it?
12:45:21 <jlund|buildduty> I like Callek's approach. we will have to change how I created the slaves. I created them via the loan process (aws_create_instance.py).
12:45:53 <jlund|buildduty> catlee: ^ this afaik, puppetizes the slave against current puppet tip. so it's not an AMI
12:47:28 <jlund|buildduty> I think to really be thorough here is we should create a loan from a puppet rev that was known to be good. and then apply the patches (as opposed to what I did of popping puppet patches)
12:47:43 <jlund|buildduty> similar to Callek's proposal

so before re-enabling ami gen on aws-manager2, we need to determine what we want to try next. I see it as:

1) take a production linux32 slave (that has an old known good AMI), and forcefully apply newer puppet patches to it
2) create a loan based on an earlier puppet patch. I'm assuming this would be like telling puppetize to not use tip or use a user puppet environment. Then test if slaves pass. If they do, apply one new patch on at a time.
3) revert the known three puppet patches that are within the window of known-good-ami to known-bad-ami and let aws-manager2 publish new AMI's
4) do nothing and see if latest AMI works

catlee, rail: thoughts?

I am moving on to other important tasks until we reach plan of action consensus.
Flags: needinfo?(catlee)
> 1) take a production linux32 slave (that has an old known good AMI), and
> forcefully apply newer puppet patches to it

I have tst-linux32-spot-001 attached to my master but and did a successful job with it (expected as it's a known-good-ami). However, I can't seem to get puppet working again on it. I'd imagine I need to add secrets/deploy-password and some puppet setup.

here is the output:
[root@tst-linux32-spot-001.test.releng.use1.mozilla.com ~]# puppet agent --test --environment=jlund --server=releng-puppet2.srv.releng.scl3.mozilla.com
Info: Creating a new SSL key for tst-linux32-spot-001.test.releng.use1.mozilla.com
Error: Could not request certificate: Error 400 on SERVER: this master is not a CA

> 2) create a loan based on an earlier puppet patch. I'm assuming this would
> be like telling puppetize to not use tip or use a user puppet environment.
> Then test if slaves pass. If they do, apply one new patch on at a time.

I tried this too but it's tricky. So I have a puppet env that points to a rev from before the bad day of AMI hell. To get a loan start on this env, I hacked cloud-tools to download a custom version of puppetize:
diff --git a/cloudtools/aws/instance.py b/cloudtools/aws/instance.py
index 54172fa..5c9f56a 100644
--- a/cloudtools/aws/instance.py
+++ b/cloudtools/aws/instance.py
@@ -134,8 +134,7 @@ def assimilate_instance(instance, config, ssh_key, instance_data, deploypass,
         run_chroot('yum install -q -y puppet cloud-init wget')

     run_chroot("wget -O /root/puppetize.sh "
-               "https://hg.mozilla.org/build/puppet/"
-               "raw-file/production/modules/puppet/files/puppetize.sh")
+            "https://raw.githubusercontent.com/lundjordan/build-puppet/045738ae531c539ea32bdb9d3a6a2d75fb144dad/modules/puppet/files/puppetize.sh")
     run_chroot("chmod 755 /root/puppetize.sh")
     put(StringIO.StringIO(deploypass), "{}/root/deploypass".format(chroot))
     put(StringIO.StringIO("exit 0\n"),



that custom version of puppetize tells the slave to use my environment:


diff --git a/modules/puppet/files/puppetize.sh b/modules/puppet/files/puppetize.sh
index 622853c..56adcef 100644
--- a/modules/puppet/files/puppetize.sh
+++ b/modules/puppet/files/puppetize.sh
@@ -64,7 +64,7 @@ while true; do
     https_proxy= python <<EOF
 import urllib2, getpass
 deploypass="""$deploypass"""
-puppet_server="${PUPPET_SERVER:-puppet}"
+puppet_server=releng-puppet2.srv.releng.scl3.mozilla.com
 print "Contacting puppet server %s" % (puppet_server,)
 if not deploypass:
     deploypass = getpass.getpass('deploypass: ')
@@ -149,13 +149,13 @@ if $interactive; then
 fi

 run_puppet() {
-    puppet_server="${PUPPET_SERVER:-puppet}"
+    puppet_server=releng-puppet2.srv.releng.scl3.mozilla.com
     echo $"Running puppet agent against server '$puppet_server'"
     # this includes:
     # --pluginsync so that we download plugins on the first run, as they may be required
     # --ssldir=/var/lib/puppet/ssl because it defaults to /etc/puppet/ssl on OS X
     # FACTER_PUPPETIZING so that the manifests know this is a first run of puppet
-    PUPPET_OPTIONS="--onetime --no-daemonize --logdest=console --logdest=syslog --color=false --ssldir=/var/lib/puppet/ssl --pluginsync --detailed-exitcodes --server $puppet_server"
+    PUPPET_OPTIONS="--environment=jlund --onetime --no-daemonize --logdest=console --logdest=syslog --color=false --ssldir=/var/lib/puppet/ssl --pluginsync --detailed-exitcodes --server $puppet_server"
     export FACTER_PUPPETIZING=true

     # check for 'err:' in the output; this catches errors even

however, I have no idea if this will work. I'm currently creating a loan based on this now.

> 3) revert the known three puppet patches that are within the window of
> known-good-ami to known-bad-ami and let aws-manager2 publish new AMI's
> 4) do nothing and see if latest AMI works

I am also creating a vanilla based loan to see if latest puppet tip fairs any better than last week.
As for #1, you just need to run ./puppetize.sh again to re-issue the certificate, and that should clean it up.

You mentioned that you now have a known-good host and a known-bad host.  Rather than struggle to get puppet to do different things, why not just diff those systems and selectively apply the diff?  Besides being hard to manipulate, puppet isn't going to give you iron-clad variables: reverting a puppet patch rarely undoes the actions of the patch; it's possible that some puppet code has nondeterministic behavior (although we try to catch that  in review); and changes to apt repositories aren't versioned (were there repo changes during the critical period?).

I'd recommend diffing the list of installed packages and versions, first (dpkg --get-selections | grep -v deinstall).  If those match, then start diffing critical root directories -- /etc, /usr, and /tools, at least.  You can probably take a tarball of each dir from a known-good and known-bad system and then use diff -r on a third system with plenty of disk space.
> 
> I'd recommend diffing the list of installed packages and versions, first
> (dpkg --get-selections | grep -v deinstall).  If those match, then start
> diffing critical root directories -- /etc, /usr, and /tools, at least.  You
> can probably take a tarball of each dir from a known-good and known-bad
> system and then use diff -r on a third system with plenty of disk space.

Thanks for your input and help Dustin. Rail did a diff similar to what you suggested and didn't find much. there is also timing changes that could be at play: e.g. newer puppet patches cause race condition where runner/puppet/X conf startup conflict and runner runs cleanslate and cleans something it's not supposed to.

I really just want to create an instance from either a known good AMI or a specific puppet rev and then apply puppet patches on it till my build breaks.

note: Friday is a STAT and I will not have computer access until Monday. Here is the current state of what I have tried. I can pick it up on Monday if not actioned:

my staging master: http://dev-master2.bb.releng.use1.mozilla.com:8038/builders/Ubuntu%20VM%2012.04%20mozilla-central%20opt%20test%20mochitest-3

my slaves:
  - tst-linux32-ec2-jlund2 - created last week and is currently pinned to my puppet env (which backed out patches up till http://hg.mozilla.org/build/puppet/rev/0fdaa817c1e8). this machine is still hitting 1147853

  - tst-linux32-ec2-jlund3 - created via loan process against latest puppet tip from Wed

  - tst-linux32-ec2-jlund4 - tried re-creating this by hacking puppetize.sh and cloudtools/scripts/aws_create_instance to point to my puppet env. Looks like it didn't complete running puppet

  - tst-linux32-spot-001 - based off known good AMI, my next attempt here was to apply my puppet env starting at http://hg.mozilla.org/build/puppet/rev/0fdaa817c1e8 and then apply patches one at a time until we hit 1147853. However it needs certs before you can run and pin to a puppet env
   - to apply certs you can run puppetize.sh on 0001 and give the deploypass or else, on a distinguashed puppet master as root run https://raw.githubusercontent.com/mozilla/build-puppet/master/modules/puppetmaster/templates/deployment_getcert.sh.erb (this will generate and echo certs)
Puppet can't cause behaviors other than by recording something on disk, so the answer is hidden in that diff somewhere.  I'll see if I can reproduce -- if I can convince tst-linux32-ec2-jlund to stop shutting down.  I'll also instantiate the latest AMI on a test host and capture an image of that, since it seems that all of the other hosts you've named have had some modification.
tst-linux32-ec2-jlund2 is also running puppet, which a spot instance shouldn't be doing, and which is applying patches subsequent to the failure.  And then it halts before I can figure out why (runner?).  So I don't really have a known-bad host to start with...

Can you get me a tarball of the root fs on a known-good and known-bad host?  Say, drop the tarballs on cruncher?  (they look to be about 6G each, probably less if you exclude some stuff under /builds)
coop: do we still have the disk images from the AMI's you removed? That might be a good place to start with an tar ball of a "bad" instance.

Trying to puppetize machines and apply partial patches is probably not going to get us expected behavior since puppetizing a machine and then rolling back patches will not result in the same system as having never had those puppet rules applied at all. Since the puppet rules execute things on the host system (add packages, modify config files, etc), it's not like code patches where a reversion removes everything that the patch added.
Flags: needinfo?(coop)
(In reply to Amy Rich [:arich] [:arr] from comment #12)
> coop: do we still have the disk images from the AMI's you removed? That
> might be a good place to start with an tar ball of a "bad" instance.

I don't think so. De-registering the AMIs == deletion.

> Trying to puppetize machines and apply partial patches is probably not going
> to get us expected behavior since puppetizing a machine and then rolling
> back patches will not result in the same system as having never had those
> puppet rules applied at all. Since the puppet rules execute things on the
> host system (add packages, modify config files, etc), it's not like code
> patches where a reversion removes everything that the patch added.

Totally agree. To do this correctly, we need to generate an AMI at each revision and then test.
Flags: needinfo?(coop)
Actually, de-registering doesn't free the snapshots -- there's a sanity task that identifies such snapshots, though.

Through cloud-trail logs, I identified the deregister operations:

awsRegion us-east-1
eventID 89a43acb-e79f-45e4-98ce-9e5a532b8f21
eventName DeregisterImage
eventSource ec2.amazonaws.com
eventTime 2015-03-26T16:47:55Z
eventType AwsApiCall
eventVersion 1.02
recipientAccountId 314336048151
requestID 4006d37e-eaab-4e2e-b5bb-07e90610d23a
requestParameters {u'imageId': u'ami-7ec2e916'}
responseElements {u'_return': True}
sourceIPAddress 76.10.141.31
userAgent console.ec2.amazonaws.com
----
awsRegion us-east-1
eventID 58aeea4c-32f0-451e-9f36-77bdf9533d63
eventName DeregisterImage
eventSource ec2.amazonaws.com
eventTime 2015-03-26T16:47:55Z
eventType AwsApiCall
eventVersion 1.02
recipientAccountId 314336048151
requestID 34be3850-c3b7-44de-95bb-0758f9470611
requestParameters {u'imageId': u'ami-2ec1ea46'}
responseElements {u'_return': True}
sourceIPAddress 76.10.141.31
userAgent console.ec2.amazonaws.com
----
awsRegion us-east-1
eventID fe6f3f87-4393-4a4b-8767-4a79c63e7f87
eventName DeregisterImage
eventSource ec2.amazonaws.com
eventTime 2015-03-26T16:47:55Z
eventType AwsApiCall
eventVersion 1.02
recipientAccountId 314336048151
requestID 123ac28c-908f-4af8-b0bd-59fae82143ee
requestParameters {u'imageId': u'ami-f0c7ec98'}
responseElements {u'_return': True}
sourceIPAddress 76.10.141.31
userAgent console.ec2.amazonaws.com

unfortunately, those don't give a snapshot id.  There aren't a lot of snapshots in place, so I'm guessing that they have been deleted.  Still, we have a known-good AMI that's running right now, and we have known-bad hosts, so we should be able to do a diff.  So back to my request in comment 11..
Flags: needinfo?(jlund)
> unfortunately, those don't give a snapshot id.  There aren't a lot of
> snapshots in place, so I'm guessing that they have been deleted.  Still, we
> have a known-good AMI that's running right now, and we have known-bad hosts,
> so we should be able to do a diff.  So back to my request in comment 11..

okay, I'm creating a fresh 'known-bad-host' now and will produce some diffs off of those now
Flags: needinfo?(jlund)
I have yet to get a tar without:
  tar: /: file changed as we read it
  tar: Exiting with failure status due to previous errors

I tried to exclude a number of mutating files and ended up with:
   tar -cvpzf linux32-ec2-jlund3.tar.gz --exclude=/linux32-ec2-jlund3.tar.gz --exclude=/proc --exclude=/lost+found --exclude=/mnt --exclude=/sys --exclude=/media --exclude=/dev --one-file-system /

at any rate, I think most things should be unpacked and read/diffed.

I copied them to cruncher.srv.releng.scl3:
  /home/buildduty/jlund/linux32-ec2-jlund3.tar.gz <- bad loan slave
  /home/buildduty/jlund/linux32-spot-003.tar.gz <- good ami slave

dustin, could you poke/diff these archives for any anomalies that seem suspect to you?
Assignee: nobody → dustin
OK, I started with /etc.  It looks like puppet was installed manually on bad, while the "fake puppet" used during AMI creation is in place on good.  Interesting bits:

** Puppet's locale setting is not in place on good.  This might be something cloud-init futzes with.

diff -ur bad/etc/default/locale good/etc/default/locale
--- bad/etc/default/locale      2015-04-02 06:19:03.000000000 +0000
+++ good/etc/default/locale     2015-04-07 14:14:30.000000000 +0000
@@ -1,2 +1 @@
 LANG="en_US.UTF-8"
-LANGUAGE="en_US:en"

** Apache seems to only be installed on good:

Only in good/etc/apache2/sites-enabled: 000-default
Only in good/etc/cron.daily: apache2

** Sudoers.d only exists on good, but it's empty so probably not relevant:

Only in good/etc: sudoers.d
> [ec2-user@ip-10-132-57-54 ephemeral0]$ sudo chroot good dpkg-query -l | sort > good.pkgs
> [ec2-user@ip-10-132-57-54 ephemeral0]$ sudo chroot bad dpkg-query -l | sort > bad.pkgs
> [ec2-user@ip-10-132-57-54 ephemeral0]$ diff -u {bad,g^C
> [ec2-user@ip-10-132-57-54 ephemeral0]$ diff -u {bad,good}.pkgs
> --- bad.pkgs  2015-04-07 22:00:13.000000000 +0000
> +++ good.pkgs 2015-04-07 22:00:11.000000000 +0000
> @@ -262,7 +262,7 @@
>  ii  gnome-session-bin                      3.2.1-0ubuntu8                          GNOME Session Manager - Minimal runtime
>  ii  gnome-session-canberra                 0.28-3ubuntu3                           GNOME session log in and log out sound events
>  ii  gnome-session-common                   3.2.1-0ubuntu8                          GNOME Session Manager - common files
> -ii  gnome-settings-daemon                  3.4.1-0ubuntu1                          daemon handling the GNOME session settings
> +ii  gnome-settings-daemon                  3.4.2-0ubuntu0.6.2                      daemon handling the GNOME session settings
>  ii  gnome-sudoku                           1:3.4.1-0ubuntu1                        Sudoku puzzle game for GNOME
>  ii  gnome-system-log                       3.4.1-0ubuntu1                          system log viewer for GNOME
>  ii  gnome-system-monitor                   3.4.1-0ubuntu1                          Process viewer and system resource monitor for GNOME
> @@ -1095,7 +1095,7 @@
>  ii  media-player-info                      16-1                                    Media player identification files
>  ii  metacity                               1:2.34.1-1ubuntu11                      lightweight GTK+ window manager
>  ii  metacity-common                        1:2.34.1-1ubuntu11                      shared files for the Metacity window manager
> -ii  mig-agent                              20150330+e3f41a6.prod                   Mozilla InvestiGator Agent
> +ii  mig-agent                              20150122+ad43a11.prod                   Mozilla InvestiGator Agent
>  ii  mime-support                           3.51-1ubuntu1                           MIME files 'mime.types' & 'mailcap', and support programs
>  ii  mobile-broadband-provider-info         20120410-0ubuntu1                       database of mobile broadband service providers
>  ii  modemmanager                           0.5.2.0-0ubuntu2                        D-Bus service for managing modems
> @@ -1370,7 +1370,7 @@
>  ii  ttf-sazanami-mincho                    20040629-8ubuntu1                       Sazanami Mincho Japanese TrueType font
>  ii  ttf-ubuntu-font-family                 0.80-0ubuntu2                           Ubuntu Font Family, sans-serif typeface hinted for clarity
>  ii  ttf-wqy-microhei                       0.2.0-beta-1ubuntu1                     A droid derived Sans-Seri style CJK font
> -ii  tzdata                                 2012e-0ubuntu0.12.04.1                  time zone and daylight-saving time data
> +ii  tzdata                                 2014a-0ubuntu0.12.04                    time zone and daylight-saving time data
>  ii  ubuntu-artwork                         57                                      Ubuntu themes and artwork
>  ii  ubuntu-desktop                         1.267                                   The Ubuntu desktop system
>  ii  ubuntu-docs                            12.04.4                                 Ubuntu Desktop Guide
> @@ -1427,7 +1427,7 @@
>  ii  usbmuxd                                1.0.7-2                                 USB multiplexor daemon for iPhone and iPod Touch devices
>  ii  usbutils                               1:005-1                                 Linux USB utilities
>  ii  util-linux                             2.20.1-1ubuntu3                         Miscellaneous system utilities
> -ii  v4l2loopback-dkms                      0.4.1-1                                 Source for the v4l2loopback driver (DKMS)
> +ii  v4l2loopback-dkms                      0.6.1-1                                 Source for the v4l2loopback driver (DKMS)
>  ii  vbetool                                1.1-2ubuntu1                            run real-mode video BIOS code to alter hardware state
>  ii  vim                                    2:7.3.429-2ubuntu2                      Vi IMproved - enhanced vi editor
>  ii  vim-common                             2:7.3.429-2ubuntu2                      Vi IMproved - Common files

That is certainly surprising (aside from mig, which was an intentional upgrade).  Taking one example:

/data/repos/apt/releng-updates/pool/universe/v/v4l2loopback/v4l2loopback-dkms_0.6.1-1_all.deb
/data/repos/apt/ubuntu/pool/universe/v/v4l2loopback/v4l2loopback-dkms_0.4.1-1_all.deb

that's installed with ensure => latest, so somehow releng-updates wasn't in scope when bad ran puppet.  The pattern repeats for tzdata and gnome-settings-daemon.  I'm not sure if this is an artifact of the creation of the good and bad images, or a clue.  Jordan, do you have an idea why that might be the case?

I'm tempted to try generating an AMI, let it start a few hosts, grab a copy, and then verify that the newly-generated hosts fail the same way and disable the new AMI.  That way we'd (a) have a known-bad AMI that was generated using the same process as the known-good AMIs and (b) verify that this is still failing.
so I'm looking at recent changes of our puppet repos. I see there is a change on March 25th (which lines up as the day in-between good-ami creation and bad-ami creation window[1]). following this change leads me to:

  - http://puppetagain.pub.build.mozilla.org/data/repos/apt/releng-updates/dists/precise-updates/all/binary-i386/
  - http://puppetagain.pub.build.mozilla.org/data/repos/apt/releng.tar.gz

reading the docs on how to add ubuntu packages[2] suggests that we do something like:
> for arch in i386 amd64; do
>   for dist in precise trusty; do
>     mkdir -p dists/${dist}/all/binary-$arch
>     dpkg-scanpackages --multiversion --arch $arch pool/$dist > dists/${dist}/all/binary-$arch/Packages
>     bzip2 < dists/${dist}/all/binary-$arch/Packages > dists/${dist}/all/binary-$arch/Packages.bz2
>   done
> done

the above cmds suggest to me that dists/precise-updates/all/binary-i386/Packages and dists/precise-updates/all/binary-i386/Packages.bz2 should not be empty. You can see the equivalent is not empty here[3] and here[4]

dustin: could this and whatever was done that day be playing a part here?

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1147853#c0
[2] https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Packages#Ubuntu:_Adding_New_Packages
[3] http://puppetagain.pub.build.mozilla.org/data/repos/apt/ubuntu/dists/precise/main/binary-i386/
[4] http://puppetagain.pub.build.mozilla.org/data/repos/apt/releng/dists/precise/main/binary-i386/
(In reply to Dustin J. Mitchell [:dustin] from comment #18)
> > [ec2-user@ip-10-132-57-54 ephemeral0]$ sudo chroot good dpkg-query -l | sort > good.pkgs
> > [ec2-user@ip-10-132-57-54 ephemeral0]$ sudo chroot bad dpkg-query -l | sort > bad.pkgs
> > [ec2-user@ip-10-132-57-54 ephemeral0]$ diff -u {bad,g^C
> > [ec2-user@ip-10-132-57-54 ephemeral0]$ diff -u {bad,good}.pkgs
> > --- bad.pkgs  2015-04-07 22:00:13.000000000 +0000
> > +++ good.pkgs 2015-04-07 22:00:11.000000000 +0000
> > @@ -262,7 +262,7 @@
> >  ii  gnome-session-bin                      3.2.1-0ubuntu8                          GNOME Session Manager - Minimal runtime
> >  ii  gnome-session-canberra                 0.28-3ubuntu3                           GNOME session log in and log out sound events
> >  ii  gnome-session-common                   3.2.1-0ubuntu8                          GNOME Session Manager - common files
> > -ii  gnome-settings-daemon                  3.4.1-0ubuntu1                          daemon handling the GNOME session settings
> > +ii  gnome-settings-daemon                  3.4.2-0ubuntu0.6.2                      daemon handling the GNOME session settings
> >  ii  gnome-sudoku                           1:3.4.1-0ubuntu1                        Sudoku puzzle game for GNOME
> >  ii  gnome-system-log                       3.4.1-0ubuntu1                          system log viewer for GNOME
> >  ii  gnome-system-monitor                   3.4.1-0ubuntu1                          Process viewer and system resource monitor for GNOME
> > @@ -1095,7 +1095,7 @@
> >  ii  media-player-info                      16-1                                    Media player identification files
> >  ii  metacity                               1:2.34.1-1ubuntu11                      lightweight GTK+ window manager
> >  ii  metacity-common                        1:2.34.1-1ubuntu11                      shared files for the Metacity window manager
> > -ii  mig-agent                              20150330+e3f41a6.prod                   Mozilla InvestiGator Agent
> > +ii  mig-agent                              20150122+ad43a11.prod                   Mozilla InvestiGator Agent
> >  ii  mime-support                           3.51-1ubuntu1                           MIME files 'mime.types' & 'mailcap', and support programs
> >  ii  mobile-broadband-provider-info         20120410-0ubuntu1                       database of mobile broadband service providers
> >  ii  modemmanager                           0.5.2.0-0ubuntu2                        D-Bus service for managing modems
> > @@ -1370,7 +1370,7 @@
> >  ii  ttf-sazanami-mincho                    20040629-8ubuntu1                       Sazanami Mincho Japanese TrueType font
> >  ii  ttf-ubuntu-font-family                 0.80-0ubuntu2                           Ubuntu Font Family, sans-serif typeface hinted for clarity
> >  ii  ttf-wqy-microhei                       0.2.0-beta-1ubuntu1                     A droid derived Sans-Seri style CJK font
> > -ii  tzdata                                 2012e-0ubuntu0.12.04.1                  time zone and daylight-saving time data
> > +ii  tzdata                                 2014a-0ubuntu0.12.04                    time zone and daylight-saving time data
> >  ii  ubuntu-artwork                         57                                      Ubuntu themes and artwork
> >  ii  ubuntu-desktop                         1.267                                   The Ubuntu desktop system
> >  ii  ubuntu-docs                            12.04.4                                 Ubuntu Desktop Guide
> > @@ -1427,7 +1427,7 @@
> >  ii  usbmuxd                                1.0.7-2                                 USB multiplexor daemon for iPhone and iPod Touch devices
> >  ii  usbutils                               1:005-1                                 Linux USB utilities
> >  ii  util-linux                             2.20.1-1ubuntu3                         Miscellaneous system utilities
> > -ii  v4l2loopback-dkms                      0.4.1-1                                 Source for the v4l2loopback driver (DKMS)
> > +ii  v4l2loopback-dkms                      0.6.1-1                                 Source for the v4l2loopback driver (DKMS)
> >  ii  vbetool                                1.1-2ubuntu1                            run real-mode video BIOS code to alter hardware state
> >  ii  vim                                    2:7.3.429-2ubuntu2                      Vi IMproved - enhanced vi editor
> >  ii  vim-common                             2:7.3.429-2ubuntu2                      Vi IMproved - Common files
> 
> That is certainly surprising (aside from mig, which was an intentional
> upgrade).  Taking one example:
> 
> /data/repos/apt/releng-updates/pool/universe/v/v4l2loopback/v4l2loopback-
> dkms_0.6.1-1_all.deb
> /data/repos/apt/ubuntu/pool/universe/v/v4l2loopback/v4l2loopback-dkms_0.4.1-
> 1_all.deb
> 
> that's installed with ensure => latest, so somehow releng-updates wasn't in
> scope when bad ran puppet.  The pattern repeats for tzdata and
> gnome-settings-daemon.  I'm not sure if this is an artifact of the creation
> of the good and bad images, or a clue.  Jordan, do you have an idea why that
> might be the case?

I re-scanned puppet changes and I can not see a patch taht would be playing a part in that. Maybe non AMI creations (loans via puppetize) have some magic that skips releng-updates. Or maybe comment 19 ^ is the culprit.

> I'm tempted to try generating an AMI, let it start a few hosts, grab a copy,
> and then verify that the newly-generated hosts fail the same way and disable
> the new AMI.  That way we'd (a) have a known-bad AMI that was generated
> using the same process as the known-good AMIs and (b) verify that this is
> still failing.

I think, at this rate, if my above findings yield nothing, this is a good plan and worth the cost of some bustage.
Comment 19 would correspond to bug 1131611 comment 26.  Ouch.  The script said dpkg-scanpackages wasn't found; without really thinking I assumed that meant it did nothing.

Let me see if I can regenerate that metadata.
OK, I wrote a modern update.sh script that runs on CentOS and dropped it in /data/repos/apt/releng-updates, then ran it.

dmitchell@releng-puppet2 /data/repos/apt/releng-updates $ sudo sh update.sh
dpkg-scanpackages: info: Wrote 8 entries to output Packages file.
dpkg-scanpackages: info: Wrote 8 entries to output Packages file.

so we should be in good shape now.  I'd like to start a new AMI generation, but I think we should get the fix for bug 1152240 in first.
Oh, and I deleted releng-update.sh and releng-update.conf so this won't happen again.
Creating new AMIs for this and bug 1152240.
Ugh, and because aws_publish_amis.py was still disabled, these all suicided.  I'm re-enabling all of the crontasks disabled in attachment 8586156 [details] [diff] [review].
Flags: needinfo?(catlee)
OK, the new AMIs are failing, too:

ami-6ebd9506 = good (from March 25)
ami-38665a50 = bad (from today)

I'll re-land attachment 8586156 [details] [diff] [review].
It wrote those 8 entries, but to dists/precise, not dists/precise-updates.  Fixed now.
re-spun new amis and published them. There was a hiccup in doing so but I will file a follow up bug in cloud-tools. confirmed 1147853 looks to be resolved.

I think all that is left to do here undo the puppet patch that disabled aws-manager cron tasks?
I think this is the only patch we need to revert (undo the relanding of disable patch ;))
Attachment #8590012 - Flags: review?(dustin)
See Also: → 1152624
Comment on attachment 8590012 [details] [diff] [review]
150408_1149580_disable_ami_gen_undo-cloud-tools.patch

Looks like a much better re-landing than my buggy version earlier today..
Attachment #8590012 - Flags: review?(dustin) → review+
Comment on attachment 8590012 [details] [diff] [review]
150408_1149580_disable_ami_gen_undo-cloud-tools.patch

thanks!

merged and pushed to prod: https://hg.mozilla.org/build/puppet/rev/af9002b32c12
Attachment #8590012 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: