990173 - Move b2g bumper to a dedicated host

Reporter

Description

•

10 years ago

b2g bumper is currently running on buildbot-master66 which is being deprecated (bug 990172).

Let's move b2g bumper to a dedicated host and call it git-hg-bumpers or similar.

Chris Cooper [:coop] (he/him)

Comment 1

•

10 years ago

Hal: did this already get moved? I see a bunch of git commands running on bm66 dating from May19, but nothing else.

Flags: needinfo?(hwine)

Hal Wine [:hwine] use NI!

Comment 2

•

10 years ago

Not to my knowledge. Puppet still believes b2g_bumper lives on bm66.

Flags: needinfo?(hwine)

Chris Cooper [:coop] (he/him)

Comment 3

•

10 years ago

buildbot-master66 has been complaining about load all day today. I think it's time to fine a beefier instance for this, or split the processes across multiple instances.

Severity: normal → major

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

10 years ago

Attached image Trend for load — Details

The b2g bumper runs every 5 minutes, and takes about 3 minutes to run. On 10/31 it took 2:30 - 2:45 to run. Nagios alerts with WARNING when any of the short/mid/longterm load are over 10 (CRITICAL is 25). We're getting alerts because the long-term load is just about always over 10 (see inset graph). 

The longer term trend comes from graphite, and the values are quite different for the short-term load, but the long-term seems reasonably similar. Something changed late on 11/04 and again on 11/08. I can't see any big changes in the manifests then, so I'm going to guess something external is making all the 'git ls-remote' operations we do slower. So splitting the work or relaxing the nagios threshold.

I've downtimed the alert for 12 hours.

Pete Moore [:pmoore][:pete]

Comment 5

•

10 years ago

I've noticed that we are not caching results to git ls-remote across branches - essentially meaning we are creating far more load than we need to. I would propose resolving this first, as with 5 branches, we might be able to reduce load by e.g. 70%-80% (depending on how much commonality exists between the manifests).

I will take a look at this.

Pete Moore [:pmoore][:pete]

Comment 6

•

10 years ago

I created a very crude script to run them in sequence, and ran locally.

The first branch took approximately 20 mins to run. The other branches took approximately 12 seconds to run, when sharing a cache across branches. In other words, it looks like with 5 branches, when sharing a cache, we should hit about 20% of the resource usage we were previously hitting.

The very crude script for testing was:

#!/usr/bin/env python

import sys 
import copy
from b2g_bumper import B2GBumper

if __name__ == '__main__':

    git_ref_cache = {}
    b2g_bumper_script = sys.argv[0].replace('_combined', '') 
    config_files = copy.copy(sys.argv[1:])
    for config in config_files:
        sys.argv = [b2g_bumper_script, '-c', config, '-c', '/Users/pmoore/work/b2g_bumper/pete.py', '--checkout-manifests', '--massage-manifests']
        bumper = B2GBumper()
        bumper._git_ref_cache = git_ref_cache
        print bumper._git_ref_cache
        bumper.run()
        print bumper._git_ref_cache

This essentially cached the bumper._git_ref_cache between runs.

I'll implement differently for this bug, as this approach messed up mozharness logging - instead I'll write out the cache as a json file, and have a script which just runs through the branches in sequence, stores the cache at the end of the run, and loads it at the beginning of the run.

Each time the cron runs, it will start with an empty cache, so the cache is just used once per branch, and then a new cache is created.

Hope this makes sense!

Pete Moore [:pmoore][:pete]

Comment 7

•

10 years ago

I've written the code to clear/import/export the cache, and tested locally, and it is working. However, I believe the parallelisation is not working properly in b2g bumper, so I want to fix this too.

I've also noticed that we query for a single revision when using git ls-remote, but for the 1017 distinct requests we make for (url, head/tag), there are only 302 unique urls - i.e. on average each git url will be queried 3.36 times for different tags/heads. I believe it will be much more efficient if we only query once and get back all heads/tags.

Another point that occurs to me is that all this data we collect and hammer git.m.o for, we already have in vcs sync. We know the state of all the heads/tags because only vcs sync pushes to git.m.o - therefore if we simply stored the state of the heads/tags in vcs sync when pushing, we would not need to keep querying git.m.o. At the moment we are polling the source repos in vcs sync to look for changes, and only pushing to git.m.o when there are changes, but with b2g bumper we are now polling git.m.o too looking for changes.

Lastly, I don't think the parallelisation is working as expected. A run b2g bumper takes on average a couple of seconds per git ls-remote command, even when parallelised, but without parallelisation, it also takes this long.

So I'm going to try to tackle these issues together, as there are massive performance improvements that can be made, which will allow us to scale much better. And we've seen that our vcs usage is very high in general, so anything we can do that will reduce load on vcs systems can only be a good thing.

Pete Moore [:pmoore][:pete]

Updated

•

10 years ago

Depends on: 1097784

Pete Moore [:pmoore][:pete]

Comment 8

•

10 years ago

I moved (most of) the above concerns (comments 5,6,7) into a separate bug (1097784).

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

9 years ago

I'm migrating this host to a new VLAN and CentOS 6.5.

If I just halt the old host and start up the new, freshly puppetized one, will b2g-bumper run?

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 10

•

9 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #9)
> I'm migrating this host to a new VLAN and CentOS 6.5.
> 
> If I just halt the old host and start up the new, freshly puppetized one,
> will b2g-bumper run?

Yes. To explain my reasoning:

1) Halting the old host should do no damage, if it dies half way through pushing, it should not be able to leave any corruption on the git server.
2) Mozharness should take care of creating fresh working directories for b2g bumper on the first run, assuming the jobs are correctly configured in a crontab, and all prerequisite packages are installed by puppet to get mozharness on its feet.
3) The first run should not take particularly long, as there is not massive data processing to do, like there is in vcs sync, so this is also not a concern.

In any case it would be worth monitoring when you start the new machine up, that mozharness is running. If anything was not automatic, please report back in the bug.

Thanks Dustin for picking this up!
Pete

Flags: needinfo?(pmoore)

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

9 years ago

We went forward with this today since trees were closed anyway.  It's taking its sweet time cloning repositories, but nothing seems to have failed yet.

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

9 years ago

Note that I had to add an additional 65G EBS volume to this host and extend the VG into that PV -- its disk was getting dangerously full without that (and the old host had a 100GB root volume).

Chris Cooper [:coop] (he/him)

Updated

•

9 years ago

Comment 13

•

8 years ago

Hal, do we still need this?

Flags: needinfo?(hwine)

Hal Wine [:hwine] use NI!

Comment 14

•

8 years ago

b2g bumper has been decommissioned

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(hwine)

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: Tools → General

Bugzilla

Quick Search

Move b2g bumper to a dedicated host

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: jhopkins, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Comment 13

Comment 14

Updated

Attachment

General

Description

File Name

Content Type