Closed Bug 990173 Opened 10 years ago Closed 8 years ago

Move b2g bumper to a dedicated host

Categories

(Release Engineering :: General, defect)

x86
Linux
defect
Not set
major

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jhopkins, Unassigned)

References

Details

Attachments

(1 file)

b2g bumper is currently running on buildbot-master66 which is being deprecated (bug 990172).

Let's move b2g bumper to a dedicated host and call it git-hg-bumpers or similar.
Hal: did this already get moved? I see a bunch of git commands running on bm66 dating from May19, but nothing else.
Flags: needinfo?(hwine)
Not to my knowledge. Puppet still believes b2g_bumper lives on bm66.
Flags: needinfo?(hwine)
buildbot-master66 has been complaining about load all day today. I think it's time to fine a beefier instance for this, or split the processes across multiple instances.
Severity: normal → major
Attached image Trend for load
The b2g bumper runs every 5 minutes, and takes about 3 minutes to run. On 10/31 it took 2:30 - 2:45 to run. Nagios alerts with WARNING when any of the short/mid/longterm load are over 10 (CRITICAL is 25). We're getting alerts because the long-term load is just about always over 10 (see inset graph). 

The longer term trend comes from graphite, and the values are quite different for the short-term load, but the long-term seems reasonably similar. Something changed late on 11/04 and again on 11/08. I can't see any big changes in the manifests then, so I'm going to guess something external is making all the 'git ls-remote' operations we do slower. So splitting the work or relaxing the nagios threshold.

I've downtimed the alert for 12 hours.
I've noticed that we are not caching results to git ls-remote across branches - essentially meaning we are creating far more load than we need to. I would propose resolving this first, as with 5 branches, we might be able to reduce load by e.g. 70%-80% (depending on how much commonality exists between the manifests).

I will take a look at this.
I created a very crude script to run them in sequence, and ran locally.

The first branch took approximately 20 mins to run. The other branches took approximately 12 seconds to run, when sharing a cache across branches. In other words, it looks like with 5 branches, when sharing a cache, we should hit about 20% of the resource usage we were previously hitting.

The very crude script for testing was:

#!/usr/bin/env python

import sys 
import copy
from b2g_bumper import B2GBumper

if __name__ == '__main__':

    git_ref_cache = {}
    b2g_bumper_script = sys.argv[0].replace('_combined', '') 
    config_files = copy.copy(sys.argv[1:])
    for config in config_files:
        sys.argv = [b2g_bumper_script, '-c', config, '-c', '/Users/pmoore/work/b2g_bumper/pete.py', '--checkout-manifests', '--massage-manifests']
        bumper = B2GBumper()
        bumper._git_ref_cache = git_ref_cache
        print bumper._git_ref_cache
        bumper.run()
        print bumper._git_ref_cache

This essentially cached the bumper._git_ref_cache between runs.

I'll implement differently for this bug, as this approach messed up mozharness logging - instead I'll write out the cache as a json file, and have a script which just runs through the branches in sequence, stores the cache at the end of the run, and loads it at the beginning of the run.

Each time the cron runs, it will start with an empty cache, so the cache is just used once per branch, and then a new cache is created.

Hope this makes sense!
I've written the code to clear/import/export the cache, and tested locally, and it is working. However, I believe the parallelisation is not working properly in b2g bumper, so I want to fix this too.

I've also noticed that we query for a single revision when using git ls-remote, but for the 1017 distinct requests we make for (url, head/tag), there are only 302 unique urls - i.e. on average each git url will be queried 3.36 times for different tags/heads. I believe it will be much more efficient if we only query once and get back all heads/tags.

Another point that occurs to me is that all this data we collect and hammer git.m.o for, we already have in vcs sync. We know the state of all the heads/tags because only vcs sync pushes to git.m.o - therefore if we simply stored the state of the heads/tags in vcs sync when pushing, we would not need to keep querying git.m.o. At the moment we are polling the source repos in vcs sync to look for changes, and only pushing to git.m.o when there are changes, but with b2g bumper we are now polling git.m.o too looking for changes.

Lastly, I don't think the parallelisation is working as expected. A run b2g bumper takes on average a couple of seconds per git ls-remote command, even when parallelised, but without parallelisation, it also takes this long.

So I'm going to try to tackle these issues together, as there are massive performance improvements that can be made, which will allow us to scale much better. And we've seen that our vcs usage is very high in general, so anything we can do that will reduce load on vcs systems can only be a good thing.
Depends on: 1097784
I moved (most of) the above concerns (comments 5,6,7) into a separate bug (1097784).
I'm migrating this host to a new VLAN and CentOS 6.5.

If I just halt the old host and start up the new, freshly puppetized one, will b2g-bumper run?
Flags: needinfo?(pmoore)
(In reply to Dustin J. Mitchell [:dustin] from comment #9)
> I'm migrating this host to a new VLAN and CentOS 6.5.
> 
> If I just halt the old host and start up the new, freshly puppetized one,
> will b2g-bumper run?

Yes. To explain my reasoning:

1) Halting the old host should do no damage, if it dies half way through pushing, it should not be able to leave any corruption on the git server.
2) Mozharness should take care of creating fresh working directories for b2g bumper on the first run, assuming the jobs are correctly configured in a crontab, and all prerequisite packages are installed by puppet to get mozharness on its feet.
3) The first run should not take particularly long, as there is not massive data processing to do, like there is in vcs sync, so this is also not a concern.

In any case it would be worth monitoring when you start the new machine up, that mozharness is running. If anything was not automatic, please report back in the bug.

Thanks Dustin for picking this up!
Pete
Flags: needinfo?(pmoore)
We went forward with this today since trees were closed anyway.  It's taking its sweet time cloning repositories, but nothing seems to have failed yet.
Note that I had to add an additional 65G EBS volume to this host and extend the VG into that PV -- its disk was getting dangerously full without that (and the old host had a 100GB root volume).
See Also: → 1207229
Hal, do we still need this?
Flags: needinfo?(hwine)
b2g bumper has been decommissioned
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(hwine)
Resolution: --- → WONTFIX
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: