Closed Bug 978928 Opened 10 years ago Closed 9 years ago

Reconfigs should be automatic, and scheduled via a cron job

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: coop)

References

Details

Attachments

(7 files, 11 obsolete files)

5.57 KB, text/x-log
Details
27.54 KB, patch
coop
: review+
pmoore
: checked-in+
Details | Diff | Splinter Review
1.22 KB, patch
massimo
: review+
pmoore
: checked-in+
Details | Diff | Splinter Review
1.39 KB, patch
jlund
: review+
Details | Diff | Splinter Review
9.75 KB, patch
jlund
: review+
Details | Diff | Splinter Review
4.40 KB, patch
Callek
: review+
Details | Diff | Splinter Review
4.93 KB, patch
kmoir
: review+
Details | Diff | Splinter Review
Coop has an end-to-end script for running a reconfig. We should set up a cron to take care of this. The only part that was missing was embedding the extra wiki text in https://wiki.mozilla.org/ReleaseEngineering/Maintenance and publishing to the wiki, so I'll attach a patch for this.
Attached file bug978928_tools.patch (obsolete) (deleted) —
This patch is for the "tools" repo.

This is just the part to publish the wiki changes - Coop has a script for the rest. =)
Attachment #8384822 - Flags: review?(coop)
The content of attachment 8384820 [details] has been deleted
It may be that an automatic reconfig requires some additional decision-tree thinking - when/when not to reconfig. This logic will need some analysis, so on the meantime a push-button solution would be great, as a first step. This would be in any case useful, even with a fully automated solution, since we always may wish to reconfig on-demand, rather than wait until next automatic reconfig. Also a push button solution removes the problem about when to reconfig, which can be built later if required.

I've even designed a button!

http://dabuttonfactory.com/b.png?t=Reconfig%20now&f=Komika-Bold&ts=50&tc=ffffff&it=png&c=round&bgt=unicolored&bgc=00d100&bs=20&bc=00d100&shs=5&shc=222&sho=se&hp=20&vp=11
Comment on attachment 8384822 [details] [diff] [review]
Corrected version of previous patch

Thanks for this, Pete.

I'll run some tests with this later today as I try to rejigger my existing scripts (https://hg.mozilla.org/build/braindump/file/83472f03f584/hg-related) into isolated, task-based chunks.
Attachment #8384822 - Flags: review?(coop) → review+
There is certainly room for tweaking if required. e.g. we don't handle the case if the content does not contain a previous entry, we don't cache the session for future iterations (each new iteration gets a new sessionid) and we don't logout (maybe this uses server resources). Also we could provide a help option (-h) and also no error handling in the case a particular step fails. It also performs no validation of the content of the wiki text it adds, and also it will cause the wiki page to continually grow, without ever removing content from the page (maybe we could limit to a given number of reconfigs to be included).

In any case, it is a fully working first version, and if we're happy with it, we can invest some more efforts to add these extra things. Just wanted you to have a working version to play with - especially if we will be migrating to a different wiki soon-ish and need to rewrite anyway. =)
Pete: here are the scripts that previously lived in braindump (https://hg.mozilla.org/build/braindump/file/83472f03f584/hg-related). None of them have been reviewed before, so queuing them all for review as they move into tools.

You've seen these scripts in action before, but I have reworked the way they interact. 

First, I broke the merge logic out into its own script (merge_to_production.sh). 

Next, I made the end_to_end_reconfig.sh script use manage_masters.py directly as the default, although tmux is still available as an option. 

Lastly, I renamed your wiki update script to update_maintenance_wiki.sh and made it source a credentials file to get what it needs. I still love to consume the curl output that we don't care about, but that can be a future enhancement.
Attachment #8384822 - Attachment is obsolete: true
Attachment #8385484 - Flags: review?(pmoore)
Some slight changes from the previous patch based on errors I hit during a trial run today:

* don't checkout tools again in end_to_end_reconfig.sh. We're already in tools/ at that point, and even (theoretically) in the correct dir.

* in case we're not in the correct dir, check for manage_masters.py

* the check for the preview files to see whether we needed to touch the reconfig_needed flag was broken. Now using |-s filename|

The wiki update worked flawlessly BTW, although I'd still like to hide those headers I don't care about form the output.
Attachment #8385484 - Attachment is obsolete: true
Attachment #8385484 - Flags: review?(pmoore)
Attachment #8385639 - Flags: review?(pmoore)
Comment on attachment 8385639 [details] [diff] [review]
[tools] Move reconfig scripts from braindump, v2

Review of attachment 8385639 [details] [diff] [review]:
-----------------------------------------------------------------

None of these comments are blocking - just suggestions. You don't need to comply with any of them - all up to you.

Mostly they are technicalities, e.g. like cd'ing into the script directory, not assuming you are already in the directory that contains the script - but even this is not so important if the script is consistently called from the directory it is contained in, but safer to have. Also like not assuming /bin/sh -> bash; although in reality it probably always is.

::: buildfarm/maintenance/end_to_end_reconfig.sh
@@ +1,2 @@
> +#!/bin/bash
> +set -ex

do we want debug output on? by the way, you can also use:
#!/bin/bash -ex
and have this in one line

@@ +6,5 @@
> +usage() {
> +echo "Usage: $0 [-t]"
> +echo "    -t: use tmux for reconfig"
> +echo ""
> +}

i'd indent the contents of the function

@@ +19,5 @@
> +	    exit 1
> +	    ;;
> +    esac
> +done
> +

to make sure your working directory is the directory of this script, you can do:
cd "$(dirname "${0}")"

@@ +21,5 @@
> +    esac
> +done
> +
> +rm -f reconfig_needed.flg
> +sh merge_to_production.sh

typically sh is a symbolic link to the shell of choice, which is often bash, but not always - maybe better to explicitly execute with
./merge_to_production.sh
since since this will respect the declaration you already have (#!/bin/bash) in the first line of merge_to_production.sh

@@ +25,5 @@
> +sh merge_to_production.sh
> +
> +if [ -e reconfig_needed.flg ]; then
> +    if [ "${USE_TMUX}" == "1" ]; then
> +	sh ../../../reconfig_tmux.sh -f

again you can drop the "sh" so that there is no assumption that e.g. /bin/sh -> bash

@@ +30,5 @@
> +    else
> +	if [ ! -e manage_masters.py ]; then
> +	    echo "Couldn't find manage_masters.py. Exiting."
> +	    exit 1
> +	fi

i think the -e manage_masters.py test is not needed if we cd "$(dirname "${0}")" at the top of the script?

@@ +33,5 @@
> +	    exit 1
> +	fi
> +	python manage_masters.py -f production-masters.json -j16 -R scheduler -R build -R try -R tests show_revisions update checkconfig reconfig
> +    fi
> +    popd

why do we need to popd?

@@ +34,5 @@
> +	fi
> +	python manage_masters.py -f production-masters.json -j16 -R scheduler -R build -R try -R tests show_revisions update checkconfig reconfig
> +    fi
> +    popd
> +    rm -rf tools

isn't this rm -rf tools handled by merge_to_production.sh?

::: buildfarm/maintenance/merge_to_production.sh
@@ +1,2 @@
> +#!/bin/bash
> +set -ex

could combine line 1+2 into: #!/bin/bash -ex

@@ +25,5 @@
> +echo -n "Merge started..."
> +date
> +
> +for d in mozharness buildbot-configs buildbotcustom; do
> +  rm -rf ${d}

which directory should this be run in? there are no "cd" statements before this line

if we do it in the current directory, are we not potentially dirtying the tools checkout (in case it crashes half-way through, etc?)

could we instead have a separate file system area (e.g. /var/reconfig) where we do the work?

@@ +28,5 @@
> +for d in mozharness buildbot-configs buildbotcustom; do
> +  rm -rf ${d}
> +  hg clone ssh://hg.mozilla.org/build/${d}
> +  pushd ${d}
> +  hg pull && hg up -r default

why do we need to pull after the clone?
why do we hg up -r default, when next we hg up to production?

@@ +65,5 @@
> +    rm -f reconfig_update_for_maintenance.wiki
> +    echo '|-' > reconfig_update_for_maintenance.wiki
> +    echo '| in production' >> reconfig_update_for_maintenance.wiki
> +    echo "| `TZ=America/Los_Angeles date +"%Y-%m-%d %H:%M PT"`" >> reconfig_update_for_maintenance.wiki
> +    echo '|' >> reconfig_update_for_maintenance.wiki

this section might be cleaner to read like this:

{
    echo '|-'
    echo '| in production'
    echo "| `TZ=America/Los_Angeles date +"%Y-%m-%d %H:%M PT"`"
    echo '|'
} > reconfig_update_for_maintenance.wiki

@@ +66,5 @@
> +    echo '|-' > reconfig_update_for_maintenance.wiki
> +    echo '| in production' >> reconfig_update_for_maintenance.wiki
> +    echo "| `TZ=America/Los_Angeles date +"%Y-%m-%d %H:%M PT"`" >> reconfig_update_for_maintenance.wiki
> +    echo '|' >> reconfig_update_for_maintenance.wiki
> +    grep summary *_preview_changes.txt | awk '{sub (/ r=.*$/,"");print substr($0, index($0,$2))}' | sed 's/[Bb]ug \([0-9]*\):* *-* */\* {{bug|\1}} - /' | sort -u >> reconfig_update_for_maintenance.wiki

if [Bb]ug is not included in commit message, do we want to include the line? If yes, it is fine. If not, we can change to sed -n 's/...../..../p'.
should we insist that [Bg]ug appears as first characters of match, i.e. 's/^[Bb]ug.......' ?

@@ +67,5 @@
> +    echo '| in production' >> reconfig_update_for_maintenance.wiki
> +    echo "| `TZ=America/Los_Angeles date +"%Y-%m-%d %H:%M PT"`" >> reconfig_update_for_maintenance.wiki
> +    echo '|' >> reconfig_update_for_maintenance.wiki
> +    grep summary *_preview_changes.txt | awk '{sub (/ r=.*$/,"");print substr($0, index($0,$2))}' | sed 's/[Bb]ug \([0-9]*\):* *-* */\* {{bug|\1}} - /' | sort -u >> reconfig_update_for_maintenance.wiki
> +    sh update_maintenance_wiki.sh reconfig_update_for_maintenance.wiki

we can drop sh - e.g.
./update_maintenance_wiki.sh reconfig_update_for_maintenance.wiki

::: buildfarm/maintenance/reconfig_tmux.sh
@@ +1,2 @@
> +#!/bin/bash
> +#set -ex

cd "$(dirname "${0}")" ?

@@ +10,5 @@
> +echo "   -f: full reconfig (show_revisions update checkconfig reconfig)"
> +echo ""
> +echo "   You must specify at least one option. You can also specify multiple options."
> +echo "   e.g. $0 -s -r"
> +}

could indent function

::: buildfarm/maintenance/update_maintenance_wiki.sh
@@ +59,5 @@
> +login_token="$(echo "${json}" | sed -e 's/.*"token":"//' -e 's/".*//')"
> +# login again, using login token received (see https://www.mediawiki.org/wiki/API:Login)
> +curl -s -b "${cookie_jar}" -d action=login -d lgname="${WIKI_USERNAME}" -d lgpassword="${WIKI_PASSWORD}" -d lgtoken="${login_token}" 'https://wiki.mozilla.org/api.php' 2>&1 > login.output
> +# get an edit token, remembering to pass previous cookies (see https://www.mediawiki.org/wiki/API:Edit)
> +edit_token="$(curl -b "${cookie_jar}" -s -v -d action=query -d prop=info -d intoken=edit -d titles=ReleaseEngineering/Maintenance 'https://wiki.mozilla.org/api.php' | sed -n 's/.*edittoken="//p' | sed -n 's/".*//p')"

can remove -v from curl command to lose headers

@@ +64,5 @@
> +# now post new content...
> +curl -s -b "${cookie_jar}" -H 'Content-Type:application/x-www-form-urlencoded' -d action=edit -d title='ReleaseEngineering/Maintenance' -d 'summary=reconfig' -d "text=$(cat "${new_content}")" --data-urlencode token="${edit_token}" 'https://wiki.mozilla.org/api.php' 2>&1 > update.output
> +grep -q Success update.output
> +RETVAL=$?
> +echo

we could log out here, to spare resources, e.g.
curl -s -b "${cookie_jar}" action=logout 'https://wiki.mozilla.org/api.php'
Attachment #8385639 - Flags: review?(pmoore) → review+
ahhh - also hadn't spotted, can you:

chmod a+x update_maintenance_wiki.sh merge_to_production.sh end_to_end_reconfig.sh

to be consistent with other scripts in that dir, and for the #!/bin/bash to work and to be able to run script directly, rather than via using "sh" or "bash" when calling script.
also, we need this to be run in a virtualenv, right? i didn't see reference to that in the scripts... how is that set up?
Addresses Pete's suggestions from comment #10 and comment #11.

(In reply to Pete Moore [:pete][:pmoore] from comment #12)
> also, we need this to be run in a virtualenv, right? i didn't see reference
> to that in the scripts... how is that set up?

The virtualenv just needs fabric. There are other issues we need to tackle before this can be fully automated though:

* need a central place to run this. cruncher perhaps, or dev-master1. Needs access via ssh to all the masters.

* need a generic account to edit the wiki via the API, potentially one with different privileges/throttling to prevent abuse or scripts run amok.

* one missing piece is updating the relevant bugs with a "merged to production" comment. It should be easy to have a separate script parse the generate wiki update file and update the bugs using an existing set of releng automated bugzilla credentials. Both autoland and slaveapi have credentials we could use.

This set of scripts works for me locally using personal credentials and keys for everything, but we'll need to address the above points if we ever want a push-button solution.

I won't have a chance to work on this more before my vacation, so feel free to pick it up and run with it if you want.
Attachment #8385639 - Attachment is obsolete: true
Attachment #8388545 - Flags: review?(pmoore)
FYI I took this and ran it on cruncher... 

never noticed it missed doing the wiki update amid all the other reconf scrollback, but heres some relevant stuff (reconf_tmp was done because of the /tmp/* that is created where I had no perms)

So `something` was surely up here.

[jwood@cruncher.srv.releng.scl3 tmp]$ ls reconf_tmp/
buildbot-configs                      buildbotcustom  reconfig_needed.flg
buildbot-configs_preview_changes.txt  mozharness      reconfig_update_for_maintenance.wiki
[jwood@cruncher.srv.releng.scl3 tmp]$ cat reconf_tmp/reconfig_update_for_maintenance.wiki
|-
| in production
| 2014-03-11 09:28 PT
|
[jwood@cruncher.srv.releng.scl3 tmp]$ cat reconf_tmp/buildbot-configs_preview_changes.txt
Merging from default

changeset:   9943:3095984721a5
parent:      9939:06bf0e3ce554
user:        ffxbld
date:        Mon Mar 10 14:05:22 2014 -0700
summary:     Update release config for Fennec-28.0b10-build1

changeset:   9946:5f48a25b6b95
parent:      9943:3095984721a5
user:        ffxbld
date:        Mon Mar 10 16:56:20 2014 -0700
summary:     Update release config for Firefox-28.0-build1

changeset:   9947:081cce52261d
user:        Armen Zambrano Gasparnian <armenzg@mozilla.com>
date:        Tue Mar 11 10:52:23 2014 -0400
summary:     Bug 837017 - Backout due to debug Jetpack's not being scheduled. r=backout

changeset:   9948:89ff838d3974
user:        Armen Zambrano Gasparnian <armenzg@mozilla.com>
date:        Tue Mar 11 11:35:47 2014 -0400
summary:     Bug 837017 - Add Linux/Linux64 debug browser-chrome for Elm. r=Callek

changeset:   9949:4a653f3ccabc
tag:         tip
user:        Chris AtLee <catlee@mozilla.com>
date:        Tue Mar 11 12:20:52 2014 -0400
summary:     Bug 979450: Switch on dep builds of emulator-kk on try, rather than periodic. r=Callek

[jwood@cruncher.srv.releng.scl3 tmp]$
Hey Callek,

Did you create the file with your user credentials for the wiki?
Do you have the full log file?

Pete
Flags: needinfo?(bugspam.Callek)
(In reply to Pete Moore [:pete][:pmoore] from comment #15)
> Hey Callek,
> 
> Did you create the file with your user credentials for the wiki?
> Do you have the full log file?

Sadly I don't have a log file for this one, but I did store my creds (though it may have been wrong path since I didn't see info to say where)... I'll give you a log next reconf attempt
It should be the case that if it could not find the file, it would report it in the logs. It might make sense to move that test to the start of the whole process, so the problem can be caught early.

It could even validate the login upfront too, with a test login and logout.

I guess a wiki authentication failure should not block the script running though, since not being able to publish to wiki should probably not be a reason to block the reconfig. It should probably output a warning at the end that it was unable to publish to the wiki, with the specific reason (no credentials / wrong credentials / publish failed / could not reach wiki server / could not download current wiki page content / .....).
(In reply to Pete Moore [:pete][:pmoore] from comment #17)
> It should be the case that if it could not find the file, it would report it
> in the logs. It might make sense to move that test to the start of the whole
> process, so the problem can be caught early.
> 
> It could even validate the login upfront too, with a test login and logout.
> 
> I guess a wiki authentication failure should not block the script running
> though, since not being able to publish to wiki should probably not be a
> reason to block the reconfig. It should probably output a warning at the end
> that it was unable to publish to the wiki, with the specific reason (no
> credentials / wrong credentials / publish failed / could not reach wiki
> server / could not download current wiki page content / .....).

well IFF it does error out and not block reconfig, would be nice if it at least pointed me at the file I can copy/paste into wiki manually, (which this did not) and afaict it didn't even generate the file(s) properly
Flags: needinfo?(bugspam.Callek)
(In reply to Justin Wood (:Callek) from comment #18)

> well IFF it does error out and not block reconfig, would be nice if it at
> least pointed me at the file I can copy/paste into wiki manually, (which
> this did not)

Agreed.

> and afaict it didn't even generate the file(s) properly

Please provide either logs, or badly created files, or provide more details about why you believe it didn't generate the file(s) properly. The script generates temporary files using mktemp command - these will not be in the working directory where the script is, or where you ran the command from, more likely under /tmp. In any case they are deleted after publishing, so you would not see them on your filesystem unless you explicitly commented out the lines that delete them, and took care to take a copy of the files from e.g. /tmp after it ran. Since you did not keep log files, we'll only know the next time you run it, anything else would just be speculation.
Attached file reconfig.log
I run end_to_end.
Here's the log.
The script can use -R all instead of listing all roles.

I would also split the update step from the checkconfig and reconfig since for reconfig/checkconfig we can use a much higher -j.

My log failed because I had merge manually myself.
IIUC reconfig_tmux.sh assumes that manage_masters.py is on the same location but it actually lives in the tools repo.
Same issue as in comment 22:
+ echo -n 'Merge finished...'
Merge finished...+ date
Fri Mar 14 10:56:04 EDT 2014
+ '[' -e /tmp/reconfig/reconfig_needed.flg ']'
+ '[' 0 == 1 ']'
+ python manage_masters.py -f production-masters.json -j16 -R scheduler -R build -R try -R tests show_revisions update checkconfig reconfig
python: can't open file 'manage_masters.py': [Errno 2] No such file or directory
The wiki got updated but did not add any bug information:
|-
| in production
| 2014-03-14 07:55 PT
|
|-
Agreed - if nothing gets merged, and the reconfig doesn't take place, then it shouldn't update the wiki. Good catch.
went to do this today for you and get a log, didn't actually get a useful one for you for the following reasons:

# First attempt failed because I couldn't write to the /tmp dir on cruncher

# second attempt failed since wiki_credentials.sh was in my cwd rather than inside tools/*
** I highly suggest you check cwd before "alongside the .sample" since I never like to keep a password file under a repository, to prevent me accidentally putting said file anywhere

# Third attempt failed because I didn't have fabric activated, so no reconfig was attempted (but it did write a blank entry to the maint page)

# Fourth attempt failed because the script didn't think a reconf was needed because hg was already pushed.



Recommendations:

* Make /tmp/reconf_tmp a `mktemp` use by default, and remove it when the script exits cleanly (and with a trap echo where all the data is incase there is an error and we want to look)

* Do the wiki update AFTER the reconfig

* Verify wiki credentials in place, and login works BEFORE the reconfig

* Never push to repos until the creds are verified

* Never push to repos unless we're in a python environ with fabric installed (e.g. run `pip freeze | grep fabric`)

* If the reconfig fails, print instructions on how to re-kick in such a way that the already-pushed repo merge will still proceed to write to wiki, and/or how to rekick the merge itself.

* automatic log of actions written to $cwd for the run. (preferably ensuring that no passwords are written out somehow)
Thanks Callek, I really appreciate the feedback. Will look into this all next week.
Assigning it to coop for now.
Assignee: nobody → coop
mgerva: Oh i just forgot john hopkins was asking where do you keep you reconfig scripts...

There is a bug with an attachment from 
coop: https://bugzilla.mozilla.org/show_bug.cgi?id=978928
He needs to
A) activate virtual env with fabric installed
B) copy the sample wiki credentials file to remove the .sample from filename
C) make sure he hasn't merged already! (Script does this)
D) put  *his* wiki credentials in that config file
E) run the reconfig script ideally *without* tmux enabled, and redirect standard out *and* standard error to a file (or use tee utility) so we don't lose output useful for debugging problems later
F) for extra debug set -xv in all bash scripts
G) report any problems in the bug
Coop: so I've had a chance to use your scripts a couple of times now, and I made a couple of tweaks during buildduty, so I'll try to get a review to you later today when some of the buildduty stuff has died down a bit...
Based on above feedback, and other enhancement ideas, I've started work on a new version, that has several enhancements:

   * If there is a merge conflict, the script will exit, you can resolve conflict, and rerun script - it will continue where it left off
   * There are multiple options to enable/disable parts of the process
   * The logging is much friendlier now, so you can always see what it is doing
   * The wiki credentials config file can live outside of the repo, to avoid accidentally committing a file with a password
   * The validation is much better - so all validation is done upfront, so any problems are found before the reconfig occurs, and any problems are identified with detailed explanations in the logging so a user can easily fix problems.

For work in progress, see my changes as the occur here:
https://github.com/petemoore/build-tools/compare/bug978928

Please note this is *work in progress* and has not been tested yet.

=========================================
Here is the help text for the new version
=========================================

This script can be used to reconfig interactively, or non-interactively. It will merge
buildbotcustom, buildbot-configs, mozharness from default to production(-0.8).
It will then reconfig, and afterwards if all was successful, it will also update the
wiki page https://wiki.mozilla.org/ReleaseEngineering/Maintenance.

Usage: ./end_to_end_reconfig.sh -h
Usage: ./end_to_end_reconfig.sh [-d] [-f] [-m] [-n] [-r RECONFIG_DIR] [-t] [-w WIKI_CREDENTIALS_FILE]

    -d:                        Dry run; will not make changes.
    -f:                        Force reconfig, even if no changes merged.
    -h:                        Display help.
    -m:                        No merging of default -> production(-0.8) of hg branches.
    -n:                        No wiki update.
    -r RECONFIG_DIR:           Use directory RECONFIG_DIR for storing temporary files
                               (default is /tmp/reconfig). This directory, and any
                               necessary parent directories will be created if required.
    -t:                        Use TMUX for reconfig (default is *not* to use TMUX).
    -w WIKI_CREDENTIALS_FILE:  Source WIKI_USERNAME and WIKI_PASSWORD env vars from file
                               WIKI_CREDENTIALS_FILE (default is ~/.wikiwriter/config).
Comment on attachment 8388545 [details] [diff] [review]
Move reconfig scripts from braindump, v3

Review of attachment 8388545 [details] [diff] [review]:
-----------------------------------------------------------------

Hi Coop,

Based on the feedback in the comments by all the folks that used it, there were a few problems that needed to be addressed.

I've rewritten somewhat to address the various points above, and pushed to here: https://github.com/petemoore/build-tools/compare/bug978928.

Will attach a patch for this shortly. I suggest various people use it a few times before we go live, I think it would be good to trial it before committing.

Pete
Attachment #8388545 - Flags: review?(pmoore) → review-
Looking forward to hearing how you get on with it! :)
Assignee: coop → pmoore
Attachment #8388545 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8407014 - Flags: review?(coop)
Attachment #8407014 - Flags: feedback?(bugspam.Callek)
Attachment #8407014 - Flags: feedback?(armenzg)
Minor fix for last patch ... whoops :)
Attachment #8407027 - Flags: review?(coop)
Attachment #8407027 - Flags: feedback?(bugspam.Callek)
Attachment #8407014 - Attachment is obsolete: true
Attachment #8407014 - Flags: review?(coop)
Attachment #8407014 - Flags: feedback?(bugspam.Callek)
Attachment #8407014 - Flags: feedback?(armenzg)
FYI: I landed a patch for buildbot-wrangler (the thing that manage_masters.py calls on the buildbot master to do the reconfig) which lets it detect most failed reconfigs. It's not directly relevant to your work here, but I thought you should know. That change is from bug 996585.
Comment on attachment 8407027 [details] [diff] [review]
Patch for https://hg.mozilla.org/build/tools/ repo

Review of attachment 8407027 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like the only outstanding piece is updating relevant bugs in bugzilla based on the output of update_maintenance_wiki.sh. That can be a follow-up piece though.

::: buildfarm/maintenance/end_to_end_reconfig.sh
@@ +2,5 @@
> +
> +START_TIME="$(date +%s)"
> +
> +# Explicitly unset any pre-existing environment variables to avoid variable collision
> +unset PREPARE_ONLY FORCE_RECONFIG MERGE_TO_PRODUCTION UPDATE_WIKI RECONFIG_DIR USE_TMUX WIKI_CREDENTIALS_FILE WIKI_USERNAME WIKI_PASSWORD

I like being able to set these via the ENV, but I'll bow to popular opinion here.

@@ +15,5 @@
> +    echo "Usage: $0 [-f] [-m] [-n] [-p] [-r RECONFIG_DIR] [-t] [-w WIKI_CREDENTIALS_FILE]"
> +    echo
> +    echo "    -f:                        Force reconfig, even if no changes merged."
> +    echo "    -h:                        Display help."
> +    echo "    -m:                        No merging of default -> production(-0.8) of hg branches."

How does one skip the merge and just run the reconfig part? Does that amount to |-f -m|?

@@ +154,5 @@
> +    if [ ! -e "${ABS_WIKI_CREDENTIALS_FILE}" ]; then
> +        echo "  * Wiki credentials file '${ABS_WIKI_CREDENTIALS_FILE}' not found; creating..." >&2
> +        {
> +            echo 'export WIKI_USERNAME="naughtymonkey"'
> +            echo 'export WIKI_PASSWORD="nobananas"'

LOL

@@ +190,5 @@
> +        echo "  * Logging to: '${RECONFIG_DIR}/virtualenv-fabric-installation.log'..."
> +        virtualenv "${RECONFIG_DIR}/fabric-virtual-env" >"${RECONFIG_DIR}/virtualenv-fabric-installation.log" 2>&1
> +        source "${RECONFIG_DIR}/fabric-virtual-env/bin/activate"
> +        echo "  * Installing fabric under '${RECONFIG_DIR}/fabric-virtual-env'..."
> +        pip install fabric >"${RECONFIG_DIR}/virtualenv-fabric-installation.log" 2>&1

I think I'd rather fail out here rather than modifying the users env on-the-fly.
Attachment #8407027 - Flags: review?(coop) → review+
(In reply to Chris Cooper [:coop] from comment #36)
> Comment on attachment 8407027 [details] [diff] [review]
> Patch for https://hg.mozilla.org/build/tools/ repo
> 
> Review of attachment 8407027 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks like the only outstanding piece is updating relevant bugs in bugzilla
> based on the output of update_maintenance_wiki.sh. That can be a follow-up
> piece though.
> 

Good idea. Have raised bug 1004617 for this.

> ::: buildfarm/maintenance/end_to_end_reconfig.sh
> @@ +2,5 @@
> > +
> > +START_TIME="$(date +%s)"
> > +
> > +# Explicitly unset any pre-existing environment variables to avoid variable collision
> > +unset PREPARE_ONLY FORCE_RECONFIG MERGE_TO_PRODUCTION UPDATE_WIKI RECONFIG_DIR USE_TMUX WIKI_CREDENTIALS_FILE WIKI_USERNAME WIKI_PASSWORD
> 
> I like being able to set these via the ENV, but I'll bow to popular opinion
> here.

I prefer, if possible, only having one mechanism to set the variables, so that we don't have to check different mechanisms, have extra logging, and including extra docs to explain the two separate systems for setting config (e.g. environment variable vs command line option). Also this could introduces contention if both are set, and it might not be intuitive that if both are set, which takes precedence. In short, since I think any script that can set environment variables can also set command line options, I'd prefer to leave it with only supporting command line options, for now. If others want it too, I can change it.

> @@ +15,5 @@
> > +    echo "Usage: $0 [-f] [-m] [-n] [-p] [-r RECONFIG_DIR] [-t] [-w WIKI_CREDENTIALS_FILE]"
> > +    echo
> > +    echo "    -f:                        Force reconfig, even if no changes merged."
> > +    echo "    -h:                        Display help."
> > +    echo "    -m:                        No merging of default -> production(-0.8) of hg branches."
> 
> How does one skip the merge and just run the reconfig part? Does that amount
> to |-f -m|?

Exactly.

> 
> @@ +154,5 @@
> > +    if [ ! -e "${ABS_WIKI_CREDENTIALS_FILE}" ]; then
> > +        echo "  * Wiki credentials file '${ABS_WIKI_CREDENTIALS_FILE}' not found; creating..." >&2
> > +        {
> > +            echo 'export WIKI_USERNAME="naughtymonkey"'
> > +            echo 'export WIKI_PASSWORD="nobananas"'
> 
> LOL

;)

> 
> @@ +190,5 @@
> > +        echo "  * Logging to: '${RECONFIG_DIR}/virtualenv-fabric-installation.log'..."
> > +        virtualenv "${RECONFIG_DIR}/fabric-virtual-env" >"${RECONFIG_DIR}/virtualenv-fabric-installation.log" 2>&1
> > +        source "${RECONFIG_DIR}/fabric-virtual-env/bin/activate"
> > +        echo "  * Installing fabric under '${RECONFIG_DIR}/fabric-virtual-env'..."
> > +        pip install fabric >"${RECONFIG_DIR}/virtualenv-fabric-installation.log" 2>&1
> 
> I think I'd rather fail out here rather than modifying the users env
> on-the-fly.

So this won't modify a users existing environment - if they do not have fabric in their environment, it will create a new virtualenv so that their python environment is not affected. If this directory already exists ("virtualenv-fabric-installation") then it won't modify it. Also it should be reasonably unlikely they create a virtualenv with exactly this name, and not use it to install fabric. This is using the same principle as mozharness, so that a user that has not set up their environment at all, will get a virtualenv created for them on first run, which it can use, so the user does not need to be concerned with setting up an environment for running the tool - the tool just takes care of this for them, and does it in a separate directory that does not exist, so their regular environment is not affected.

I've attached an updated patch to cover the two following issues:
1) There was a mistake in the second manage_masters.py line:

<             ./manage_masters.py -f production-masters.json -j32 -R scheduler -R build -R try -R checkconfig reconfig
---
>             ./manage_masters.py -f production-masters.json -j32 -R scheduler -R build -R try -R tests checkconfig reconfig

and

2) urlencode the text being published (e.g. to escape & in literal text)

<     curl -s -b "${cookie_jar}" -H 'Content-Type:application/x-www-form-urlencoded' -d action=edit -d title='ReleaseEngineering/Maintenance' -d 'summary=reconfig' -d "text=$(cat "${new_content}")" --data-urlencode token="${edit_token}" 'https://wiki.mozilla.org/api.php' 2>&1 > "${publish_log}"
---
>     curl -s -b "${cookie_jar}" -H 'Content-Type:application/x-www-form-urlencoded' -d action=edit -d title='ReleaseEngineering/Maintenance' -d 'summary=reconfig' --data-urlencode "text=$(cat "${new_content}")" --data-urlencode token="${edit_token}" 'https://wiki.mozilla.org/api.php' 2>&1 > "${publish_log}"
Attachment #8407027 - Attachment is obsolete: true
Attachment #8407027 - Flags: feedback?(bugspam.Callek)
Attachment #8416025 - Flags: review?(coop)
Comment on attachment 8416025 [details] [diff] [review]
Patch for https://hg.mozilla.org/build/tools/ repo

Review of attachment 8416025 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think anyone is going to use the tmux script. We should drop it.

Everything else looks good.
Attachment #8416025 - Flags: review?(coop) → review+
Thanks Coop. I've dropped TMUX support, as requested. The rest is the same (except the 'm' was missing from getopts command, so -m option would never get activated - now fixed).
Attachment #8416025 - Attachment is obsolete: true
Attachment #8418693 - Flags: review?(coop)
Comment on attachment 8418693 [details] [diff] [review]
Patch for https://hg.mozilla.org/build/tools/ repo

Review of attachment 8418693 [details] [diff] [review]:
-----------------------------------------------------------------

I'll try using this for the rest of the week as buildduty and we can work out any remaining kinks.
Attachment #8418693 - Flags: review?(coop) → review+
Hey Coop,

This is a little embarrassing, but I just spotted that there are a whole bunch of commits I made 3 weeks ago, that I never attached to this bug. Here you can see in my github branch, the work I had done, but forgotten to attach to this bug:

https://github.com/petemoore/build-tools/commit/4f666d406df372aa0dccac3d86a2438a3cd0fc04

I just spotted that half these changes were missing, so I've attached a new patch which includes all of my other fixes that I made 3 weeks ago, plus the latest changes you asked for (e.g. the patch to remove support for tmux).

Sorry for the noise. :)

Pete
Attachment #8418693 - Attachment is obsolete: true
Attachment #8419348 - Flags: review?(coop)
Attachment #8419348 - Flags: review?(coop) → review+
Attachment #8419348 - Flags: checked-in+
Somewhat related -- today we ran into an issue where a reconfig was initiated (based off more than one bbot csets) up to three times before any of them finished on certain masters. - bug 1009880

This was partly due to human error but as we move to an automatic/cron method of reconfigs I see a danger of this happening again. ie: a cron reconfig starting and a human initiated reconfig being triggered before the cron finishes.

maybe we should have some way of ensuring that a reconfig can only happen on a master when there is not already a reconfig currently being conducted.

This could be overkill but I think it warrants consideration.
Absolutely - and I agree locking at the master level is the correct place to do it, rather than a "global" lock - since masters can in theory reconfig independently of each other. Will create a separate bug for this.
Depends on: 1010126
See also https://bugzilla.mozilla.org/show_bug.cgi?id=1009880#c11 - I couldn't decide where best to comment.
A very minor fix, as I am testing vcs_sync at the moment, and need a new commit to come in on tools repo!

So killing two birds with one stone, and knocking a simple "to-do" off the list here...
Attachment #8429926 - Flags: review?(mgervasini)
Attachment #8429926 - Flags: review?(mgervasini) → review+
Comment on attachment 8429926 [details] [diff] [review]
Create wiki credentials file parent directory if it does not exist

https://hg.mozilla.org/build/tools/rev/031dea929f78
Attachment #8429926 - Flags: checked-in+
Coop, is this a good project to give an intern?

Other features that might be nice:
*) merging puppet repo too (we now have a 'production' branch)
*) forcing a puppet update on buildbot masters, to pick up any puppet changes
*) updating bugzilla tickets of bugs that land
*) providing logs in a central location
*) providing a web interface "big green button" to reconfig on demand too

Let me know if you'd like me to split this up into multiple bugs.
Adding Rail's feedback from https://bugzilla.mozilla.org/show_bug.cgi?id=904176#c0:

It would be great to run reconfigs as a buildbot/jenkins jobs for several reasons:

- avoid dependency on the "my laptop" environment
- flaky WiFi
- formalized merge procedure
- can be run by sheriffs in case of emergency

We may need to figure out how to update the maintenance page.
(In reply to Pete Moore - on PTO until June 27 [:pete][:pmoore] from comment #50) 
> - avoid dependency on the "my laptop" environment
> - flaky WiFi
> - formalized merge procedure
> - can be run by sheriffs in case of emergency
> 
> We may need to figure out how to update the maintenance page.

My thought here is that we would run the reconfig automatically after the other jenkins tests pass. We'd need some checks on this...maybe wait 30min to see if other changes land, and check for other reconfigs already in progress (per-master reconfig lockfiles may help us here).
(In reply to Pete Moore - on PTO until June 27 [:pete][:pmoore] from comment #50) 
> We may need to figure out how to update the maintenance page.

We should be able to setup a generic account for this. We also need to update the individual bugs too.
used the script today. twas pretty neat. One hiccup:

the script failed on me the first run through during the master 'update' step on manage_foopies. Although it saw I still had pending things to do on my second run, I lost my 'preview_changes' state of the repos and the wiki didn't think there was anything to update.

maybe we should put the preview_changes files inside a 'pending_changes' dir. we use that dir to still see if we have an 'incomplete' reconfig but then also use the files in it to update the wiki with?

What do you think? Again, script is awesome, so much tender love and care in it :)
QA Contact: armenzg → bugspam.Callek
at this point in development this bug is more a "Tool" than a "Buildduty Task"
Component: Buildduty → Tools
QA Contact: bugspam.Callek → hwine
Status: ASSIGNED → NEW
See Also: → 1040013
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2276]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2276] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2284]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2284] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2289]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2289] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2284]
Assignee: pmoore → nobody
I have a script that has been updating my dev-master for the past week, running from a cron job. Right now I'm looking into how I could deploy this using puppet, most likely via buildmaster-cron.erb.
Assignee: nobody → coop
Status: NEW → ASSIGNED
I was thinking yesterday that it would be nice to have a new fabric/ansible action that could spit out the last reconfig event for each master from the reconfig.log. I think I'll want to change the log structure to make it easier to match start/end state for a single reconfig event.
This script (or some version thereof) has been running in cron on dev-master2 since late March, and has automatically kept my build-master up-to-date. Full output logs are in /builds/buildbot/coop/build-master (reconfig.log), but here's an excerpt:

2015-04-06 16:00:01 - INFO  - Checking whether we need to reconfig...
2015-04-06 16:00:03 - INFO  - buildbotcustom: production-0.8 tag has moved - old rev: 6f6da6574e4467bb09e769f9b02f156e7cdee50f; new rev: e0a257bbdc725612b8376618f61ac4b474a3de17
2 files updated, 0 files merged, 0 files removed, 0 files unresolved
2015-04-06 16:00:05 - INFO  - buildbot-configs: production tag has moved - old rev: 8de75b38051abf46d85676a0d7c16284eeaeb7bd; new rev: 94284d353ec40d323b0f77bbbfb166b66958cf0f
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
2015-04-06 16:00:07 - INFO  - tools: default tag is unchanged - rev: 4311a7b8259f7fe41d077d0708692cba96e2e95f
2015-04-06 16:00:07 - INFO  - Starting reconfig
2015-04-06 16:00:15 - INFO  - Reconfig completed successfuly.
2015-04-06 16:00:15 - INFO  - Elapsed: 0h 0m 14s
2015-04-06 16:00:15 - INFO  - ==================================================
2015-04-06 17:00:01 - INFO  - Checking whether we need to reconfig...
2015-04-06 17:00:03 - INFO  - buildbotcustom: production-0.8 tag is unchanged - rev: e0a257bbdc725612b8376618f61ac4b474a3de17
2015-04-06 17:00:04 - INFO  - buildbot-configs: production tag is unchanged - rev: 94284d353ec40d323b0f77bbbfb166b66958cf0f
2015-04-06 17:00:05 - INFO  - tools: default tag is unchanged - rev: 4311a7b8259f7fe41d077d0708692cba96e2e95f
2015-04-06 17:00:05 - INFO  - No reconfig required.
2015-04-06 17:00:05 - INFO  - Elapsed: 0h 0m 4s
2015-04-06 17:00:05 - INFO  - ==================================================

The reconfig details still end up in the twistd.log as usual.

The script itself is pretty simple:
* check for production tag changes on buildbot-configs and buildbotcustom
* kickoff a reconfig via buildbot_wrangler if the tag has moved
* update tools repo as well

Logging may be a point of contention. Reconfigs are our bread-and-butter and as such I think they warrant their own data stream, so I opted not to re-use an existing log source (e.g. syslog).

Puppet patch for deployment coming up.
Attachment #8589126 - Flags: review?(jlund)
Comment on attachment 8589126 [details] [diff] [review]
[tools] Script to kick off a reconfig if production tag moves on buildbot-configs or buildbotcustom

Review of attachment 8589126 [details] [diff] [review]:
-----------------------------------------------------------------

My only extra desire here would be to have a check against travis before running reconf:

either using https://github.com/travis-ci/travis.rb
or some other method against http://docs.travis-ci.com/api/#branches

e.g. https://api.travis-ci.org/repos/mozilla/build-buildbotcustom/branches/production-0.8 and https://api.travis-ci.org/repos/mozilla/build-buildbot-configs/branches/production
Comment on attachment 8589126 [details] [diff] [review]
[tools] Script to kick off a reconfig if production tag moves on buildbot-configs or buildbotcustom

Review of attachment 8589126 [details] [diff] [review]:
-----------------------------------------------------------------

wow, this is pretty cool. 

some thoughts:
1) agree with callek that checking with travis health would be pretty neat.
2) so no more checkconfig at all?
3) what happens if production tag on one of the buildbot repos is updated before cron runs this and a required accompanying patch in the other buildbot repo doesn't land on prod till after? Do we need to give us a 5min or 10m window to land all our patches? Is the risk too small to worry about?
4) how will maintenance wiki page play a part?
5) what about when ffxbld adds a no-op production tag? Would it be taxing to add more reconfigs when we don't need them?
   a) e.g. http://hg.mozilla.org/build/buildbot-configs/rev/82047ee2301f
     # from your reconfig log
     2015-04-06 18:00:01 - INFO  - Checking whether we need to reconfig...
     2015-04-06 18:00:03 - INFO  - buildbotcustom: production-0.8 tag has moved - old rev: e0a257bbdc725612b8376618f61ac4b474a3de17; new rev: 555f7b968f07ac5337644052ceec950ebb4d1ad0
     1 files updated, 0 files merged, 0 files removed, 0 files unresolved
     2015-04-06 18:00:05 - INFO  - buildbot-configs: production tag has moved - old rev: 94284d353ec40d323b0f77bbbfb166b66958cf0f; new rev: 82047ee2301fec565e1fd29fdf9f2edc6e2c0f07
     5 files updated, 0 files merged, 0 files removed, 0 files unresolved
     2015-04-06 18:00:07 - INFO  - tools: default tag has moved - old rev: 4311a7b8259f7fe41d077d0708692cba96e2e95f; new rev: 3dce4733eee837bf9b7a22564011fb11332fc618
     1 files updated, 0 files merged, 0 files removed, 0 files unresolved
    2015-04-06 18:00:08 - INFO  - Starting reconfig

::: buildfarm/maintenance/maybe_reconfig.sh
@@ +146,5 @@
> +fi
> +source bin/activate
> +
> +if reconfig_needed; then
> +    # We append the START_TIME to the reconfig milestone messges to make it easier to match up milestones

messages?

::: buildfarm/maintenance/reconfig-logrotate.conf
@@ +1,1 @@
> +/builds/buildbot/coop/*/reconfig.log {

is this staying in your dir?
(In reply to Jordan Lund (:jlund) from comment #62)
> some thoughts:
> 1) agree with callek that checking with travis health would be pretty neat.

I agree, but is this the right place to do it?

We have a problem right now with travis throughput. Having all the masters loop while waiting on travis for however long (sometimes over an hour at present) doesn't strike me as the most efficient workflow. I'd rather have the travis checks done as part of moving the production tag in the first place, i.e. the production tag shouldn't move *unless* the tests pass on that revision.

> 2) so no more checkconfig at all?

How much do we trust our travis tests? ;) You're right, I should add it back in.

> 3) what happens if production tag on one of the buildbot repos is updated
> before cron runs this and a required accompanying patch in the other
> buildbot repo doesn't land on prod till after? Do we need to give us a 5min
> or 10m window to land all our patches? Is the risk too small to worry about?

You're making a strong argument for unifying buildbot-configs and buildbotcustom here. A race condition is inevitable while those two repos are distinct.

> 4) how will maintenance wiki page play a part?

It won't. That is, I think we need a better tool. The Maintenance page is a poor-man's summation of the changes you can see from the various hg logs anyway.

Your slave_health v2, for example, could give reports of deltas between production tags and/or lists of what changed across repos for a given time window.

> 5) what about when ffxbld adds a no-op production tag? Would it be taxing to
> add more reconfigs when we don't need them?
>    a) e.g. http://hg.mozilla.org/build/buildbot-configs/rev/82047ee2301f
>      # from your reconfig log
>      2015-04-06 18:00:01 - INFO  - Checking whether we need to reconfig...
>      2015-04-06 18:00:03 - INFO  - buildbotcustom: production-0.8 tag has
> moved - old rev: e0a257bbdc725612b8376618f61ac4b474a3de17; new rev:
> 555f7b968f07ac5337644052ceec950ebb4d1ad0
>      1 files updated, 0 files merged, 0 files removed, 0 files unresolved
>      2015-04-06 18:00:05 - INFO  - buildbot-configs: production tag has
> moved - old rev: 94284d353ec40d323b0f77bbbfb166b66958cf0f; new rev:
> 82047ee2301fec565e1fd29fdf9f2edc6e2c0f07
>      5 files updated, 0 files merged, 0 files removed, 0 files unresolved
>      2015-04-06 18:00:07 - INFO  - tools: default tag has moved - old rev:
> 4311a7b8259f7fe41d077d0708692cba96e2e95f; new rev:
> 3dce4733eee837bf9b7a22564011fb11332fc618
>      1 files updated, 0 files merged, 0 files removed, 0 files unresolved
>     2015-04-06 18:00:08 - INFO  - Starting reconfig

Are you talking about when release-runner initiates a reconfig? In that case, the maybe_reconfig script will find the production tags already moved and won't do anything, except on masters (i.e. tests masters) that don't currently get reconfig-ed automatically for a release.

It would be great if *all* the scripts that can reconfig a master started checking for the lock file before doing so.
 
> > +    # We append the START_TIME to the reconfig milestone messges to make it easier to match up milestones
> 
> messages?

Fixed.

> ::: buildfarm/maintenance/reconfig-logrotate.conf
> @@ +1,1 @@
> > +/builds/buildbot/coop/*/reconfig.log {

This changes to /builds/buildbot/*/reconfig.log for actual deployment.
Attachment #8589126 - Attachment is obsolete: true
Attachment #8589126 - Flags: review?(jlund)
thanks for the reply! all sounds good.

(In reply to Chris Cooper [:coop] from comment #63)
> (In reply to Jordan Lund (:jlund) from comment #62)
> > some thoughts:
> > 1) agree with callek that checking with travis health would be pretty neat.
> 
> I agree, but is this the right place to do it?
> 

we discussed over vidyo and I'm on board with this not being the place to do this.

> > 2) so no more checkconfig at all?
> 
> How much do we trust our travis tests? ;) You're right, I should add it back
> in.

I suppose it doesn't hurt :)

> 
> > 3) what happens if production tag on one of the buildbot repos is updated
> > before cron runs this and a required accompanying patch in the other
> > buildbot repo doesn't land on prod till after? Do we need to give us a 5min
> > or 10m window to land all our patches? Is the risk too small to worry about?
> 
> You're making a strong argument for unifying buildbot-configs and
> buildbotcustom here. A race condition is inevitable while those two repos
> are distinct.

unified would be great. maybe a 5min window would be nice as it usually takes a couple min to merge both if we do it by hand. (iiuc travis has a delay between landing).
> 
> > 4) how will maintenance wiki page play a part?
> 
> It won't. That is, I think we need a better tool. The Maintenance page is a
> poor-man's summation of the changes you can see from the various hg logs
> anyway.

gotcha. makes sense

> > 5) what about when ffxbld adds a no-op production tag? Would it be taxing to
> > add more reconfigs when we don't need them?

> Are you talking about when release-runner initiates a reconfig?

ya, I suppose this isn't a big deal.
Attachment #8589128 - Flags: review?(jlund) → review+
 > (In reply to Chris Cooper [:coop] from comment #63)
> > (In reply to Jordan Lund (:jlund) from comment #62)
> > > some thoughts:
> > > 1) agree with callek that checking with travis health would be pretty neat.
> > 
> > I agree, but is this the right place to do it?
> > 
> 
> we discussed over vidyo and I'm on board with this not being the place to do
> this.

To be specific, we decided that the merge script is the correct place to do this check. We should *not* be merging to production if the builds are failing tests.

I'm working on a python script now to check the travis status of the repo via the api.
Fixed nits and created a library for common shell functions as suggested in our 1x1.

Also updated the end_to_end_reconfig.sh script to use the new library.
Attachment #8590344 - Flags: review?(jlund)
Attachment #8590344 - Attachment description: bug978928_automate_reconfigs_v2.diff[tools] Script to kick off a reconfig if production tag moves on buildbot-configs or buildbotcustom, v2 → [tools] Script to kick off a reconfig if production tag moves on buildbot-configs or buildbotcustom, v2
Comment on attachment 8590349 [details] [diff] [review]
[tools] Script to kick off a reconfig if production tag moves on buildbot-configs or buildbotcustom, v2bug978928_automate_reconfigs_v2.diff

Review of attachment 8590349 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/maintenance/maybe_reconfig.sh
@@ +67,5 @@
> +# Include our shared function library
> +. ${MASTER_DIR}/tools/lib/shell/functions
> +
> +# Check to see if a reconfig is already in progress. Bail if one is.
> +LOCKFILE=${MASTER_DIR}/reconfig_in_progress.lock

NIT: can you please update the fabric script to use this lock when doing its own reconfig as well, So it doesn't conflict with this.

Otherwise ship-it clicks could cause simultaneous reconfigs (well it can now too, but with humans currently watching directly when reconfigs happen its easier and quicker to recover from than if it happens by two independant automatic things). and its actually more likely to conflict, consider the following example:

* ship-it gets pushed at 00:50
* production tags move and get pushed no later than 00:55
* ship-it starts reconfig (builders+schedulers) at 00:56
* this script starts reconfig at 01:00
* we have two reconfigs in progress at once on a bunch of masters
   \- This has been a source of breaking-buildbot many times in the past for us, once a reconfig starts you have to wait for it to finish :-) [or hard-abort it by killing the whole master process]
> * production tags move and get pushed no later than 00:55
> * ship-it starts reconfig (builders+schedulers) at 00:56
> * this script starts reconfig at 01:00

re earlier comments, wouldn't the prod tags be bumped already and so this script would only start a reconfig if there was a new push between 00:56 and 01:00. Though, granted, it would be better to remove any chance of overlap.
Attachment #8590349 - Flags: review?(jlund) → review+
(In reply to Jordan Lund (:jlund) from comment #69)
> > * production tags move and get pushed no later than 00:55
> > * ship-it starts reconfig (builders+schedulers) at 00:56
> > * this script starts reconfig at 01:00
> 
> re earlier comments, wouldn't the prod tags be bumped already and so this
> script would only start a reconfig if there was a new push between 00:56 and
> 01:00. Though, granted, it would be better to remove any chance of overlap.

well, you might be right... *sorta* since script looks on local machine if prod tags move *on the machine* it does a reconfig... so if we munge a bit and say the machine does its reconfig first, THEN ship-it triggers one, since ship-it's reconfig code doesn't care if prod tags actually move during "update" we can still hit that issue, thus the lock file there would also protect this script from external reconfigs.
The bad news is that there's no easy way to reconcile a travis job with an hg commit. The best I've been able to do is match commit messages from travis build output to hg, but this is not foolproof because sometimes the same commit message is re-used, especially for bugs with multi-part patches. I think this will be problematic until/if we use github as the RoR. I _could_ simply check that the most recent job on the master branch passed, which is admittedly more than we do right now.

How do we want to handle this? Should I continue to try to integrate travis sanity checking, or should we rely on buildduty to do this prior to initiating the reconfig?

Remaining steps AFAICT:
* update end_to_end_reconfig.sh script to not run manage_masters by default. Still need this ability for emergencies.
** ENHANCEMENT: check travis prior to moving production tags (see above)
* update manage_masters.py to respect the new lockfile, and create it when required
* land all the things
Flags: needinfo?(mgervasini)
Flags: needinfo?(jlund)
Flags: needinfo?(bugspam.Callek)
(In reply to Chris Cooper [:coop] from comment #71)
> How do we want to handle this? Should I continue to try to integrate travis
> sanity checking, or should we rely on buildduty to do this prior to
> initiating the reconfig?


I'm ok with manual checking of travis results in the interim. Its at least no-worse than today.
Flags: needinfo?(bugspam.Callek)
(In reply to Chris Cooper [:coop] from comment #71)
> The bad news is that there's no easy way to reconcile a travis job with an
> hg commit. The best I've been able to do is match commit messages from
> travis build output to hg, but this is not foolproof because sometimes the
> same commit message is re-used, especially for bugs with multi-part patches.
> I think this will be problematic until/if we use github as the RoR. I
> _could_ simply check that the most recent job on the master branch passed,
> which is admittedly more than we do right now.

See https://github.com/mozilla/build-mozharness#to-match-commits-to-upstream-hg-changesets

This principle applies to all github.com/mozilla/build-* repos.
Flags: needinfo?(coop)
(In reply to Pete Moore [:pmoore][:pete] from comment #74)
> Another alternative is:
> https://wiki.mozilla.org/ReleaseEngineering/Applications/
> Mapper#Returns_a_mapping_pair

Thanks, Pete!

We need the hg clone to move the tag, but I was trying to avoid having to also clone the git repo just to check the travis status. The mapper service looks like the way to go.

Given comment #72, I'll land the other changes (once reviewed) so that reconfigs can start happening automatically, and then add in the travis check after I get it working.
Flags: needinfo?(mgervasini)
Flags: needinfo?(jlund)
Flags: needinfo?(coop)
This patch to end_to_end_reconfig.sh performs the merges as per normal, but only reconfigs if FORCE_RECONFIG is set.

I also fixed an error that I found while testing. The script now compares merge output for each repo against an empty merge file so we stop running reconfigs when nothing has really changed.

Patch assumes that attachment #8590349 [details] [diff] [review] is already applied.
Attachment #8593693 - Flags: review?(bugspam.Callek)
I have a patch for the fabric actions.py to respect the new lockfile, but it's getting late and I want to test it properly and also add an action to remove the lockfile, so I'll leave it for tomorrow.
(In reply to Chris Cooper [:coop] from comment #77)
> I have a patch for the fabric actions.py to respect the new lockfile, but
> it's getting late and I want to test it properly and also add an action to
> remove the lockfile, so I'll leave it for tomorrow.

...and here it is.
Attachment #8594054 - Flags: review?(bugspam.Callek)
Comment on attachment 8594054 [details] [diff] [review]
[tools] Make manage_masters.py respect the reconfig lockfile, and add new actions to manipulate the lockfile

Review of attachment 8594054 [details] [diff] [review]:
-----------------------------------------------------------------

On the high level I expected the lockfile here to be *created* by the reconfig too, not just test for it.

The idea for having it create the lockfile in a normal reconfig is so that e.g. release runner doesn't get stomped on by the automated reconfig. E.g. release runner starting the reconfig then the cron trying to reconfig on top (and vice versa).

Especially consider the following:
* Release runner starts a reconfig 
   \- does merge
    |- starts reconfig testing for lockfile (no lockfile present)
    |- reconfig never created lockfile
* ~5 minutes pass, reconfig still in progress
* someone lands change to production for a tree bustage
* The automated reconfig code sees that change and tries to reconfig
* we now have two reconfigs in parallel, which is known to break stuff.
Attachment #8594054 - Flags: review?(bugspam.Callek) → review-
Attachment #8593693 - Flags: review?(bugspam.Callek) → review+
(In reply to Justin Wood (:Callek) from comment #79)
> On the high level I expected the lockfile here to be *created* by the
> reconfig too, not just test for it.

Yep, good catch. Just me being brain-dead late last night.

This won't help the case where someone is manually running a reconfig on a single master, but hopefully that's an isolated enough use case that they can either use the fabric tools or manage the lockfile by hand. I simplified the lockfile name to make that easier.
Attachment #8594054 - Attachment is obsolete: true
Attachment #8594158 - Flags: review?(bugspam.Callek)
Attachment #8594158 - Attachment description: Make manage_masters.py respect the reconfig lockfile, and add new actions to manipulate the lockfile, v2 → [tools] Make manage_masters.py respect the reconfig lockfile, and add new actions to manipulate the lockfile, v2
(In reply to Chris Cooper [:coop] from comment #80)
> This won't help the case where someone is manually running a reconfig on a
> single master, but hopefully that's an isolated enough use case that they
> can either use the fabric tools or manage the lockfile by hand. I simplified
> the lockfile name to make that easier.

Ping re: outstanding review request. 

I want to get this deployed *before* Selena takes over buildduty on Friday. It will also save you some headaches in the interim if it gets deployed before that.
Comment on attachment 8594158 [details] [diff] [review]
[tools] Make manage_masters.py respect the reconfig lockfile, and add new actions to manipulate the lockfile, v2

Kim: can you review here? Callek is out for the day and I'd like to deploy this ASAP next week to make buildduty slightly easier for Selena.

Revised patch addresses Callek's concerns from a previous iteration, namely creating and respecting the lockfile from within the fabric actions.
Attachment #8594158 - Flags: review?(bugspam.Callek) → review?(kmoir)
Comment on attachment 8594158 [details] [diff] [review]
[tools] Make manage_masters.py respect the reconfig lockfile, and add new actions to manipulate the lockfile, v2

lgtm
Attachment #8594158 - Flags: review?(kmoir) → review+
Comment on attachment 8590349 [details] [diff] [review]
[tools] Script to kick off a reconfig if production tag moves on buildbot-configs or buildbotcustom, v2bug978928_automate_reconfigs_v2.diff

Review of attachment 8590349 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/d3d85ffcce66
Attachment #8590349 - Flags: checked-in+
Comment on attachment 8593693 [details] [diff] [review]
[tools] Don't kick off a reconfig after merging by default

Review of attachment 8593693 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/402e043b0e35
Attachment #8593693 - Flags: checked-in+
Comment on attachment 8594158 [details] [diff] [review]
[tools] Make manage_masters.py respect the reconfig lockfile, and add new actions to manipulate the lockfile, v2

Review of attachment 8594158 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/9c956d58c48b

::: lib/python/util/fabric/actions.py
@@ +13,5 @@
>  
>  OK = green('[OK]')
>  FAIL = red('[FAIL]')
>  
> +RECONFIG_LOCKFILE = 'reconfig.lock'

I've updated the maybe_reconfig script to use the same, shorter lockfile name.
Attachment #8594158 - Flags: checked-in+
I'll land the puppet changes first thing tomorrow EST so I can watch it.
Comment on attachment 8589128 [details] [diff] [review]
[puppet] Hourly crontask to check whether reconfig is needed

Review of attachment 8589128 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/puppet/rev/ece27b86626c
Attachment #8589128 - Flags: checked-in+
I've checked masters of each type, and they're all checking whether they need to reconfig every hour now. Cron spam was fixed in bug 1161571.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2284]
Awesome work Coop! =)
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: