Closed Bug 1094293 Opened 10 years ago Closed 9 years ago

Add initial Yosemite(10.10) slaves to releng CI

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

x86_64
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: kmoir)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4223] [10.10][mac])

Attachments

(4 files, 4 obsolete files)

We'll have 3 Yosemite slaves available soon (bug 1094279) that we can use to start greening up testing on 10.10. They'll need to be added to releng CI in the standard places:

* slavealloc
* buildbot-configs
* buildbotcustom?
* mozharness?
* tools
* slaveapi
* relengapi
* slave heath
* graphserver
* treeherder
Machine entries exist in graphserver (staging and prod) and in slavealloc.

Puppet work was covered by bug 1078882.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Most of the work is happening over in bug 1094288 because it's essentially a s/mavericks/yosemite/g operation.
See Also: → 1094288
Attached patch Pass env to ScriptFactory steps (obsolete) — Splinter Review
The problem I'm trying to solve here is that /usr/local/bin is not in the default PATH on Yosemite, while it is in PATH on Snow Leopard/Lion/Mountain Lion/Mavericks. hg, wget, and the screenresolution script are in /usr/local/bin, so having it in the PATH is kinda important.

In this patch, I add a standard PATH for all Mac machines. I then explicitly set the env of steps that need access to hg, wget, or screenresolution.

I haven't done exhaustive testing, but I have tested Yosemite, w864, and talos-linux32 in staging to make sure the change works and doesn't affect other builds. Essentially, if a platform doesn't have an env already set, nothing changes.

The change to clone_buildtools in MozillaBuildFactory has a small chance of breaking things, but again, if the env wasn't set before, nothing changes now.
Attachment #8529884 - Flags: review?(jlund)
Comment on attachment 8529884 [details] [diff] [review]
Pass env to ScriptFactory steps

Review of attachment 8529884 [details] [diff] [review]:
-----------------------------------------------------------------

::: process/factory.py
@@ +492,5 @@
>          self.addStep(MercurialCloneCommand(
>                       name='clone_buildtools',
>                       command=['hg', 'clone', self.buildToolsRepo, 'tools'],
>                       description=['clone', 'build tools'],
>                       log_eval_func=rc_eval_func({0: SUCCESS, None: RETRY}),

you mentioned this step may fail. After today's disaster, we should probably change this line so we don't blanket RETRY like we did for mozharness :)
Comment on attachment 8529884 [details] [diff] [review]
Pass env to ScriptFactory steps

Review of attachment 8529884 [details] [diff] [review]:
-----------------------------------------------------------------

looks good. IIUC, env.py will only affect test jobs. However, it looks like you are adding self.env to an hg MozillaBuildFactory step (I'm guessing cause MozillaTestFactory won't die). So this will affect build jobs too. I *think* that is okay since it looks like we define platforms[pf]['env'] everywhere correctly in mozilla/config.py.

I think we should block on waiting for a https://bugzilla.mozilla.org/show_bug.cgi?id=1094293#c5 solution before landing this though

::: process/factory.py
@@ +4872,5 @@
>                  ))
>              self.addPeriodicRebootSteps()
>  
>  
> +def resolution_step(env={}):

I think this is safe in this situation but I try to never use mutable default arguments: http://docs.python-guide.org/en/latest/writing/gotchas/
Attachment #8529884 - Flags: review?(jlund) → review+
Comment on attachment 8529884 [details] [diff] [review]
Pass env to ScriptFactory steps

Review of attachment 8529884 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Jordan Lund (:jlund) from comment #6) 
> I think we should block on waiting for a
> https://bugzilla.mozilla.org/show_bug.cgi?id=1094293#c5 solution before
> landing this though

I agree it should be fixed, but blocking my patch on it doesn't make sense. My patch isn't changing the existing behavior here, and catlee's work in bug 733663 should fix this globally.

> I think this is safe in this situation but I try to never use mutable
> default arguments: http://docs.python-guide.org/en/latest/writing/gotchas/

Huh, that's very useful. /me commits those to memory.

https://hg.mozilla.org/build/buildbotcustom/rev/7eef3d9293d2
Attachment #8529884 - Flags: checked-in+
Comment on attachment 8529884 [details] [diff] [review]
Pass env to ScriptFactory steps

Review of attachment 8529884 [details] [diff] [review]:
-----------------------------------------------------------------

Backed out for build bustage. :(
Attachment #8529884 - Flags: checked-in+ → checked-in-
Here I thought the puppet part was done (see comment #1).

This patch gets cltbld setup with a proper shell on 10.10, complete with PATH.

I just ran some tests in staging and they ran fine, properly inheriting the PATH from the OS.
Attachment #8529884 - Attachment is obsolete: true
Attachment #8530394 - Flags: review?(jlund)
Comment on attachment 8530394 [details] [diff] [review]
[puppet] Setup user properly on 10.10

Review of attachment 8530394 [details] [diff] [review]:
-----------------------------------------------------------------

Oh cool!

much simpler indeed. I think I understand what is going on and this seems much safer as it should only affect yosemite.

Though, since I'm not a puppet expert, just want to verify the testing. By 'ran some tests in staging', does that mean you pinned your puppet environment to a yosemite slave?

I poked around macosx_productversion_major and I noticed a few places where we haven't defined 10.10 case yet. I think I can understand why we haven't with most but just wanted to sanity check this one: http://mxr.mozilla.org/build/source/puppet/modules/disableservices/manifests/common.pp#58
Attachment #8530394 - Flags: review?(jlund) → review+
(In reply to Jordan Lund (:jlund) from comment #11)
> I poked around macosx_productversion_major and I noticed a few places where
> we haven't defined 10.10 case yet. I think I can understand why we haven't
> with most but just wanted to sanity check this one:
> http://mxr.mozilla.org/build/source/puppet/modules/disableservices/manifests/
> common.pp#58

It's quite possible we'll need a block like http://mxr.mozilla.org/build/source/puppet/modules/disableservices/manifests/common.pp#71 soon. What Kim discovered when she was setting up 10.9 was that Apple had changed many of the services which is why the block is so different than for 10.8 and earlier.

I'll start digging into the services part. This initial patch gets the tests themselves unblocked.
Comment on attachment 8530394 [details] [diff] [review]
[puppet] Setup user properly on 10.10

Review of attachment 8530394 [details] [diff] [review]:
-----------------------------------------------------------------

http://hg.mozilla.org/build/puppet/rev/c13ab9c78a85
Attachment #8530394 - Flags: checked-in+
(In reply to Chris Cooper [:coop] from comment #13)
> Comment on attachment 8530394 [details] [diff] [review]
> [puppet] Setup user properly on 10.10
> 
> Review of attachment 8530394 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> http://hg.mozilla.org/build/puppet/rev/c13ab9c78a85

something is not right: https://tbpl.mozilla.org/?tree=Cedar&rev=44e61e86879c

had ~4k jobs spun up from cloning mozharness again between our 2 yosemite slaves. Not sure why it failed but after the week we have had, we should address this issue of RETRY when hg fails in all places.

IIUC philor stopped the bleeding by disabling yosemite slaves via slavealloc.
A couple of things happened here.

I reimaged the two yosemite nodes before returning them to production. The puppet change got merged to production, but there is still a random element to when that goes live, as determined by:

/bin/sh -c sleep $((($RANDOM/(32768/200)))); /etc/puppet/update.sh

I should have forced an update by hand, but did not. :/

Obviously the auto-retry is a big issue too. Because the build never manages to clone tools, it can't even run count_and_reboot.py to pull in the puppet changes that would have stopped the bleeding. This is one of the reasons why we *need* that shared tools checkout on each slave.
OK, there's more failing here than I think I can debug on my own.

I'm re-opening bug 1078882 to get help with things that should be working.
Depends on: 1078882
Assignee: coop → kmoir
seeing 
Upon execvpe hg ['hg', 'clone', 'https://hg.mozilla.org/build/mozharness', 'scripts'] in environment id 140423192525424
:Traceback (most recent call last):
  File "/tools/buildbot-0.8.4-pre-moz4/lib/python2.7/site-packages/twisted/internet/process.py", line 414, in _fork
    executable, args, environment)
  File "/tools/buildbot-0.8.4-pre-moz4/lib/python2.7/site-packages/twisted/internet/process.py", line 460, in _execChild
    os.execvpe(executable, args, environment)
  File "/tools/buildbot-0.8.4-pre-moz4/lib/python2.7/os.py", line 353, in execvpe
    _execvpe(file, args, env)
  File "/tools/buildbot-0.8.4-pre-moz4/lib/python2.7/os.py", line 380, in _execvpe
    func(fullname, *argrest)
OSError: [Errno 2] No such file or directory
program finished with exit code 1

I wonder if this may have just been a problem with the old hg we tried to install..

Note that it seems to work fine with the version I built and deployed:

[cltbld@t-yosemite-r5-0002.test.releng.scl3.mozilla.com ~]$ /tools/python27-mercurial/bin/hg --version
Mercurial Distributed SCM (version 3.2.1)
(see http://mercurial.selenic.com for more information)

Copyright (C) 2005-2014 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[cltbld@t-yosemite-r5-0002.test.releng.scl3.mozilla.com ~]$ /tools/python27_mercurial/bin/hg --version
Mercurial Distributed SCM (version 3.2.1)
(see http://mercurial.selenic.com for more information)

Copyright (C) 2005-2014 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[cltbld@t-yosemite-r5-0002.test.releng.scl3.mozilla.com ~]$ ls
Applications    Desktop         Downloads       Library         Movies          Music           Pictures
[cltbld@t-yosemite-r5-0002.test.releng.scl3.mozilla.com ~]$ /tools/python27_mercurial/bin/hg clone https://hg.mozilla.org/build/
mozharness
warning: hg.mozilla.org certificate with fingerprint af:27:b9:34:47:4e:e5:98:01:f6:83:2b:51:c9:aa:d8:df:fb:1a:27 not verified (ch
eck hostfingerprints or web.cacerts config setting)
destination directory: mozharness
requesting all changes
adding changesets
adding manifests
adding file changes
added 3460 changesets with 6124 changes to 528 files (+1 heads)
updating to branch default
332 files updated, 0 files merged, 0 files removed, 0 files unresolved
Whiteboard: [10.10][mac] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/4223] [10.10][mac]
I have green results for some opt results on my master but the talos tests are timing out, investigating
(In reply to Kim Moir [:kmoir] from comment #18)
> I have green results for some opt results on my master but the talos tests
> are timing out, investigating

Is the addition "Require all granted" to talos.conf in place? See https://bugzilla.mozilla.org/show_bug.cgi?id=1078882#c30.
Thanks coop. I fixed that and it addressed the issue.  I have a patch to fix this in puppet.
Attached patch bug1094293apache.patch (obsolete) — Splinter Review
Add "Require all granted" to apache config on 10.10 to talos tests run
Attachment #8536643 - Flags: review?(dustin)
Comment on attachment 8536643 [details] [diff] [review]
bug1094293apache.patch

Just an observational question,

Is there a reason we don't just use template if on osx10.10 in the .conf rather than fork the .conf wholesale?

(if this is the only change I just imagine it will be lest mental complexity)
Attachment #8536643 - Flags: review?(dustin)
Comment on attachment 8536643 [details] [diff] [review]
bug1094293apache.patch

Agree with Callek - these should be the same template, just with conditionals on the OS.
Attachment #8536643 - Flags: review-
Attached patch bug1094293apache-2.patch (obsolete) — Splinter Review
Attachment #8536643 - Attachment is obsolete: true
Attachment #8536684 - Flags: review?(dustin)
This patch fixes 
*the apache issue for talos tests
*disables several services that were already disabled on 10.9 so update notifications etc, don't appear
* adds softlink to /usr/local/bin/hg because it was missing
* installs xcode

coop or dustin: either one of you are welcome to review, would be nice to get 10.10 tests running on cedar before the holidays

I ran tests in staging with these changes overnight and the tests aren't all green. However, the orange seems to be valid test failures, not infrastructure related.
Attachment #8538668 - Flags: review?(dustin)
Attachment #8538668 - Flags: review?(coop)
Attachment #8536684 - Attachment is obsolete: true
Attachment #8536684 - Flags: review?(dustin)
Comment on attachment 8538668 [details] [diff] [review]
bug1094293puppet-2.patch

Review of attachment 8538668 [details] [diff] [review]:
-----------------------------------------------------------------

The 'hg' symlink still feels gross to me - like we're papering over something we don't understand.  Are the PATH differences going to cause other errors?  Will this change (which isn't conditionalized on OS version) cause issues on other OS X versions?

I don't have a better idea.. just a feeling :/

r- for the really minor non-touchy-feely stuff below.

::: modules/disableservices/manifests/common.pp
@@ +67,5 @@
>                              enable => false,
>                              ensure => stopped,
>                      }
>                  }
> +                10.9, 10.10: {

Oops!!

::: modules/packages/manifests/mozilla/py27_mercurial.pp
@@ +58,5 @@
> +                ["/usr/bin/hg"]:
> +                    ensure => link,
> +                    owner => "root",
> +                    replace => "no",
> +                    group => "$users::root::group",

No quotes here

::: modules/talos/manifests/init.pp
@@ +55,2 @@
>                      include packages::javadeveloper_for_os_x
>                      # not sure why this is required, but it appears to be

Can you update this comment with the reason?

::: modules/talos/manifests/settings.pp
@@ +5,5 @@
>  
>  class talos::settings {
>      # public variables used by talos module
>      $apachedocumentroot = "/builds/slave/talos-slave/talos-data/talos"
> +    $requireall =  $::macosx_productversion_major ? {

This is a good fix, but worth a comment.
Attachment #8538668 - Flags: review?(dustin) → review-
Comment on attachment 8538668 [details] [diff] [review]
bug1094293puppet-2.patch

Review of attachment 8538668 [details] [diff] [review]:
-----------------------------------------------------------------

Nothing to add from Dustin's comments.
Attachment #8538668 - Flags: review?(coop)
I don't understand why this is an oops
>                              enable => false,
>                              ensure => stopped,
>                      }
>                  }
> +                10.9, 10.10: {

I left the 10.9 stuff there because my understanding is that some of the signing servers are 10.9

I don't like the hg fix either, but have changed it to be limited to os and point directly at the binary. Running some tests again now to make sure nothing broke.
Depends on: 1113484
Attachment #8538763 - Flags: checked-in+
Attached patch bug1094293puppet-4.patch (obsolete) — Splinter Review
Again, links to binaries in /usr/local/bin because the plist isn't loading the path to /usr/local/bin in the environment.   I hope to have a better fix for but I don't have one right now.  

Current status of tests in staging on cedar
* all talos tests are green
* opt tests have a lot of orange that developers need to look at once we enabled them
* Builder Rev5 MacOSX Yosemite 10.10 cedar opt test webapprt-chrome is failing with an error I'm investigating
* Builder Rev5 MacOSX Yosemite 10.10 cedar opt test jetpack is failing but is doing the same on production branches today
Attachment #8539310 - Flags: review?(dustin)
Comment on attachment 8539310 [details] [diff] [review]
bug1094293puppet-4.patch

Can't plist files add to PATH? (For the stuff they run)
That is the problem.  There is a bug in 10.10 with plists adding to the path.
Failure message for webapprt-chrome seems to be something with the test path.  Apparently this test suite is only enabled on cedar anyways so I'm not going to worry about it that much.

So I think the steps forward after this last patch is approved and landed are
*reimage yosemite slaves
*attach them to a mac test master
*try to run tests on cedar
*ask developers to look at all the orange

I don't think anything needs to be done on the buildbot configs side to enable them on cedar since this worked out of the box on my test dev master.
Comment on attachment 8539310 [details] [diff] [review]
bug1094293puppet-4.patch

If it was just hg, I was OK with a temporary workaround until the PATH issue could be resolved, but this is getting pretty bad :(

At the very least, these hacks should all be in the package::mozilla::xx class (here, the screenresolution hack isn't), and flagged with comments referencing to this bug indicating that they're a temporary workaround.

Morgan, is there a way we could set the path in runner, instead of in the plist?  Kim, do you have links or anything like that about the PATH bug?
Flags: needinfo?(winter2718)
Attachment #8539310 - Flags: review?(dustin) → review-
Yes, you can modify the environment of runner tasks by adding an entry to runner.cfg under the [env] heading. See here: http://hg.mozilla.org/build/puppet/file/111e26506b74/modules/runner/templates/runner.cfg.erb#l12

Just add an entry like:

[env]
PATH=${PATH}:/home/yadda/yadda/bin
Flags: needinfo?(winter2718)
patch to add /usr/local/bin to path in runner and remove the symlink to hg

Morgan: thanks for the advice

Dustin: Here's a link regarding the 10.10 issue http://apple.stackexchange.com/questions/106355/setting-the-system-wide-path-environment-variable-in-mavericks

I tested this patch in staging and the results are green so far
Attachment #8539310 - Attachment is obsolete: true
Attachment #8544718 - Flags: review?(dustin)
Comment on attachment 8544718 [details] [diff] [review]
bug1094293-5.patch

Review of attachment 8544718 [details] [diff] [review]:
-----------------------------------------------------------------

Interesting -- that's talking about /etc/launchd.conf, not the EnvironmentVariables plist key, which is still in the launchd.plist(5) manpage on 10.10 (side-note, apparently Apple quit updating their online manpages, as they're still for 10.9)

     EnvironmentVariables <dictionary of strings>

          This optional key is used to specify additional environmental 
          variables to be set before running the job. Each value in the
          dictionary is the value of an environment vari- able, with the
          corresponding value being a string representing the desired value.
          NOTE: Values other than strings will be ignored.

Anyway, this fix is better than relying on launchd to set the path for RUNNER regardless of what's wrong with launchd.  Thanks for revising from the earlier fixes!
Attachment #8544718 - Flags: review?(dustin) → review+
Comment on attachment 8544718 [details] [diff] [review]
bug1094293-5.patch

and merged to production
Attachment #8544718 - Flags: checked-in+
Depends on: 1118902
Depends on: 1119412
So there are 10.10 jobs running on cedar but they aren't showing up on treeherder yet due to bug 1119412.  The jobs yesterday were all orange because cedar was in a bad state due to bug 1118902.  Today, I manually started some new mac jobs on cedar via the buildapi, I'll look at the results once they are available.
Depends on: 1119955
I ran some more jobs again today and you can see the results here
https://treeherder.mozilla.org/#/jobs?repo=cedar&revision=7c03738e7a95&filter-searchStr=10.10

They are taking a while because only two macs are servicing them.
Depends on: 1125853
I think this can be closed.  We have three 10.10 slaves running on try and working is going well to green up the test results.  I'll open other bugs when we need to enable a larger pool of machines on for other branches.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: