Closed Bug 1154795 Opened 9 years ago Closed 8 years ago

add an endpoint to relengapi that creates and returns a Mozharness archive via S3

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: jlund)

References

Details

Attachments

(1 file, 1 obsolete file)

this is based off the skeleton[1] relengapi repo following instructions here[2]

what this does:

# if their is already a mozharness archive in s3 based on a given
# $BRANCH and $REV, redirect (302) to the s3 link
# (you can pass a query arg with specific region)
> curl -i http://127.0.0.1:8010/mozharness_archiver/mozilla-central/9eae3880b132
HTTP/1.0 302 FOUND
Content-Type: text/html; charset=utf-8
Content-Length: 613
Location: https://mozharness-archiver-basic-us-west-2.s3-us-west-2.amazonaws.com/mozilla-central-9eae3880b132?Signature=DI4GK5okD5bYlgABAxFJlLrt6KY%3D&Expires=1433284255&AWSAccessKeyId=AKIAIYHUTJ7BG2GMUTXA
Server: Werkzeug/0.10.4 Python/2.7.6
Date: Tue, 02 Jun 2015 22:25:55 GMT

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>Redirecting...</title>
<h1>Redirecting...</h1>
<p>You should be redirected automatically to target URL: <a href="https://mozharness-archiver-basic-us-west-2.s3-us-west-2.amazonaws.com/mozilla-central-9eae3880b132?Signature=DI4GK5okD5bYlgABAxFJlLrt6KY%3D&amp;Expires=1433284255&amp;AWSAccessKeyId=AKIAIYHUTJ7BG2GMUTXA">https://mozharness-archiver-basic-us-west-2.s3-
us-west-2.amazonaws.com/mozilla-central-9eae3880b132?Signature=DI4GK5okD5bYlgABAxFJlLrt6KY%3D&amp;Expires=1433284255&amp;AWSAccessKeyId=AKIAIYHUTJ7BG2GMUTXA</a>.  If not click the link.%






# if it is not in s3 yet, the request is considered long, so return
# an Accepted (202) response with a url location for the long task being run.
> curl -i http://127.0.0.1:8010/mozharness_archiver/mozilla-central/9eae3880b132
HTTP/1.0 202 ACCEPTED
Content-Type: application/json
Content-Length: 18
Location: http://127.0.0.1:8010/mozharness_archiver/status/9eae3880b132
Server: Werkzeug/0.10.4 Python/2.7.6
Date: Tue, 02 Jun 2015 22:19:43 GMT

{
  "result": {}
}%


# then, when you look up that task, provide state (e.g. PENDING, SUCCESS, FAILURE)
# and the s3 urls for where relengapi uploaded the mh archive
> curl -i http://127.0.0.1:8010/mozharness_archiver/status/9eae3880b132
HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 663
Server: Werkzeug/0.10.4 Python/2.7.6
Date: Tue, 02 Jun 2015 22:20:01 GMT

{
  "result": {
    "hgmo_url": "http://hg.mozilla.org/mozilla-central/archive/9eae3880b132.tar.gz/testing/mozharness",
    "state": "SUCCESS",
    "status": "Task complete! See usw2_s3_url and use1_s3_url results for archive locations.",
    "use1_s3_url": "https://mozharness-archiver-basic-us-east-1.s3.amazonaws.com/%7Bbranch%7D-%7Brev%7D?Signature=2Y2Rv%2Bz6YBXu5oYO02dwCKdE0cU%3D&Expires=1433283647&AWSAccessKeyId=AKIAIYHUTJ7BG2GMUTXA",
    "usw2_s3_url": "https://mozharness-archiver-basic-us-west-2.s3-us-west-2.amazonaws.com/%7Bbranch%7D-%7Brev%7D?Signature=4%2Fc7Ayc%2Fliyes24yviggW9U1NRE%3D&Expires=1433283646&AWSAccessKeyId=AKIAIYHUTJ7BG2GMUTXA"
  }
}%


dustin, I'm more concerned with the high level approach. I don't want to tackle the list of TODO's below unless you think the REST endpoints above are adhering to your expectations. If it looks good, I'll proceed and we can tackle my hacky code later ;)

TODO

1) tests!
2) we need a badpenny cron task to clean up the archives that we save to disc locally (from hgmo)
3) test out the retry method for downloading the archive
4) docstrings!


[1] https://github.com/mozilla/build-relengapi-skeleton.git
[2] https://api.pub.build.mozilla.org/docs/development/@relengapi/blueprints/
Attachment #8614768 - Flags: feedback?(dustin)
I should note that this patch and repo[1] it's based off of are not applicable if we end up combining relengapi repos into one repo. But, again, I'm worried about functionality and can tackle that later

[1] https://github.com/lundjordan/mozharness-archiver
Comment on attachment 8614768 [details] [diff] [review]
150603_1154795_relengapi_mh_archiver.patch

It should be straightforward to merge this into the single repo, either with git merge (if you want to keep history) or just copying the source.

I think my major large-scale concern is how much overlap exists between this tool and gps's bundleclone work -- I see that these are tarballs whereas bundleclone is putting bundles on S3.  But aside from that, these two designs seem very similar!  Is the idea that the tarball is just a single revision, and thus smaller than the bundle?  Could we add this functionality directly to bundleclone?

I love the use of Celery, and particularly fetching the task status of the 202 directly from Celery.

The use of signed URLs is nice, even though this is public data, since it will discourage users from linking directly to files on S3.

I'm also a bit confused by the name.  This is called "Mozharness Archiver" which, along with the title of this bug, suggests it would archive snapshots of the mozharness repo.  Yet the examples show it returning snapshots of mozilla-central, and it seems like it could handle any hg repository.  Is it really that general?  Maybe renaming "branch" to "repo" would help clarify.

What will the access pattern look like?  If dozens of builds all request a new archive in a short amount of time, we'll kick off a bunch of download-and-upload operations in parallel.  That won't cause errors, but might mean this tool puts just as much load on hg.m.o as accessing the archives directly.

Some small bits:

 * region names should not be hard-coded; they should be in the configuration
 * similarly for bucket names
 * the blueprint's files should be in relengapi/blueprints/mozharness_archiver/
 * relengapi already requires 'redo', which supplies a retry method, so there's no need to copy/paste that code
   * but in this case it is probably better to use Celery's retry functionality
 * you could parallelize the S3 uploads, and get retry functionality easily, by using celery subtasks
Attachment #8614768 - Flags: feedback?(dustin) → feedback+
thanks for the thoughtful feedback!

> I think my major large-scale concern is how much overlap exists between this
> tool and gps's bundleclone work -- I see that these are tarballs whereas
> bundleclone is putting bundles on S3.  But aside from that, these two
> designs seem very similar!  Is the idea that the tarball is just a single
> revision, and thus smaller than the bundle?  Could we add this functionality
> directly to bundleclone?

I may be parsing wrong but my impression of bundleclone[1] is that it still requires you to have a full clone of a repo. This is taking just a subdir archive of a repo. More on that below.

[1] http://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/bundleclone.html


> I'm also a bit confused by the name.  This is called "Mozharness Archiver"
> which, along with the title of this bug, suggests it would archive snapshots
> of the mozharness repo.  Yet the examples show it returning snapshots of
> mozilla-central, and it seems like it could handle any hg repository.  Is it
> really that general?  Maybe renaming "branch" to "repo" would help clarify.

we are actually taking a subdir archive from a vcs repo. so not a full clone or shallow clone (they contain a full checkout + .hg and history). Nor is it a tarball of the whole repo (full checkout without hg history). It's actually a tarball of part of the repo. In this case, we are taking just the root mozharness dir and its subcontents from in-tree. The result is that downloads and archives themselves are minimal. Kilobytes!

I see what you mean about how this seems more general than Mozharness related. Maybe I should make things more configurable and generic so that we could take this logic and apply it elsewhere (maybe hg build/tools and put tools in-tree too).

> 
> What will the access pattern look like?  If dozens of builds all request a
> new archive in a short amount of time, we'll kick off a bunch of
> download-and-upload operations in parallel.  That won't cause errors, but
> might mean this tool puts just as much load on hg.m.o as accessing the
> archives directly.

I think this is oversight on my part. I am using apply_async() (rather than delay()) so that I can manually assign the task ID. What I should be doing is checking if there is a current task with the same ID, and if it exists and is incomplete, return a 302 with location of that current task. This way, we don't kick off more than one exact same task

> 
> Some small bits:
> 
>  * region names should not be hard-coded; they should be in the configuration
>  * similarly for bucket names

will-fix

>  * the blueprint's files should be in
> relengapi/blueprints/mozharness_archiver/

will-fix

>  * relengapi already requires 'redo', which supplies a retry method, so
> there's no need to copy/paste that code

I see in slaveloan's setup.py we require 'redo' but in build-relengapi and my skeleton repo's setup.py, it doesn't include 'redo'. Am I groking wrong?

>    * but in this case it is probably better to use Celery's retry
> functionality

okay, I'll use http://celery.readthedocs.org/en/latest/reference/celery.app.task.html?highlight=retry#celery.app.task.Task.retry


>  * you could parallelize the S3 uploads, and get retry functionality easily,
> by using celery subtasks

I'll investigate subtasks.

thanks again!
another thing I neglected to think about:

we often have repo url prefixes for repos:

e.g. projects/ash or releases/mozilla-beta

I'll need to account for things like 'integration', 'projects', and 'releases' somehow.

dustin, would you prefer:

1) I have a manifest/dict in my config/settings.py that takes a 'branch' and returns a repo url location
** e.g. {'ash': 'https://hg.mozilla.org/projects/ash'}

2) or I add a query param to the GET url that accepts a prefix
** e.g. curl -i http://127.0.0.1:8010/mozharness_archiver/ash/9eae3880b132?prefix=projects?region=us-east-1

option 1 could be more fragile as we would have to modify the relengapi manifest each time we add a new repo. option 2 requires more work/knowledge from the client.
Ah, I had no idea that this was for subdirectories.  That must be what the "testing/mozharness" bit in HGMO_URL_TEMPLATE means.  I can't quite come up with a pithy short name, but it seems like just a little generalization could make this a generic tool for getting tarballs of subdirectories of Mozilla hg repositories at specific revisions, minimizing load on hg.  It feels like that should be in dev-services' realm of responsibilities long-term, but makes sense here short-term.

`redo` is in relengapi's requirements now [1] (as of this morning)

As for repos with prefixes, it might be best to either take the whole repo string as a query arg, or require that the slashes be quoted in the URL parameter ('integration%2fmozilla-inbound').  I don't think it makes sense to separate the prefix and the repo name within the request URL.  One other option is to just accept the full artifact path for hgmo, do some basic validation (/.*\/archive\/[0-9a-f]{12}\.tar\.gz\/.*/), and then pass it on.

In fact, that feels like something that proxxy could already do with just a little extra configuration.  I guess my gut is still saying that this service shouldn't be necessary..

I noticed a few other minor things while having another look:

 * use https://hg.mozilla.org -- all communication should be encrypted and certificates validated
 * downloading the archive to disk is not great (and the path you've chosen won't work in production), since it has the potential to create disk-space issues on the celery nodes.  If you can manage to stream that data down from hg and straight up to S3, block-by-block, that would be best.

[1] https://github.com/mozilla/build-relengapi/blob/master/setup.py#L36
> It feels like that should be in dev-services' realm of responsibilities
> long-term, but makes sense here short-term.

this was something I discussed with dev-services. Originally we were going to try to hammer hgmo for every request but I thought I could take some of their burden by creating an API that tries to be more efficient. Plus I got to play with relengapi finally!

I think TaskCluster is saving each repo on the slave so this bug won't be useful to them and thus only will act as a stopgap for our current buildbot + slave setup.

I want to just get this done so I unblock TC and get mh in tree as fast as possible without much code change in our own infra.


> As for repos with prefixes

I'll try full repo path (without the rev) as a query_arg. It should also be a quick easy solution.

> 
> In fact, that feels like something that proxxy could already do with just a
> little extra configuration.  I guess my gut is still saying that this
> service shouldn't be necessary..

Maybe, I could look into how proxxy works but I feel like we are pretty close with this relengapi effort. I've only used it with the proxxy client (which is unfortunately in mozharness ;) so I worry there would be too much time required and not enough reward for effort in switching. See above for my arguments in support of this relengapi endpoint.
 
> 
> I noticed a few other minor things while having another look:
> 
>  * use https://hg.mozilla.org -- all communication should be encrypted and
> certificates validated

will-fix

>  * downloading the archive to disk is not great (and the path you've chosen
> won't work in production), since it has the potential to create disk-space
> issues on the celery nodes.  If you can manage to stream that data down from
> hg and straight up to S3, block-by-block, that would be best.

that was the reason for my TODO in comment 1: "2) we need a badpenny cron task to clean up the archives that we save to disc locally (from hgmo)"

I wanted to upload to s3 by url but that's not supported yet. What I could try is to do is store the file in memory and trick boto via: set_contents_from_file(in_memory_fp)
OK, I can breathe easier knowing this is temporary.  If this is later supported by dev-services directly, or via proxxy, then it should be a small change on the client.

I was thinking of using key.set_contents_from_stream(resp):

    resp = urllib2.urlopen(url)
    # verify statuc code, content-length, etc.
    key.set_contents_from_stream(resp)
> I was thinking of using key.set_contents_from_stream(resp):
> 
>     resp = urllib2.urlopen(url)
>     # verify statuc code, content-length, etc.
>     key.set_contents_from_stream(resp)

there is something more elegant with your approach than what I was suggesting:
    k.set_contents_from_file(StringIO.StringIO(urllib2.urlopen(url).read()))

;) thanks for pointing me to set_contents_from_stream
Not loading the whole file into memory will help avoid MemoryErrors or swapping, too :)
(In reply to Dustin J. Mitchell [:dustin] from comment #10)
> Not loading the whole file into memory will help avoid MemoryErrors or
> swapping, too :)

so I've been battling with this most the morning. I'm trying to grab an hgmo archive with either requests or urllib2 and both always seem to give me transfer-encoding:chunked. Which is bad, as s3 + boto do not accept chunked transfers for set_contents_from_stream.

I tried playing around with explicitly adding the content-length but no such luck:


example of one of my many stabs in the dark:


In [27]: key.set_contents_from_stream(resp2, headers={'Content-length': resp2.headers['Content-Length']})
---------------------------------------------------------------------------
BotoClientError                           Traceback (most recent call last)
<ipython-input-27-42fe5c0dcdd4> in <module>()
----> 1 key.set_contents_from_stream(resp2, headers={'Content-Length': resp2.headers['Content-Length']})

/Users/jlund/.virtualenvs/subrepo-archives/lib/python2.7/site-packages/boto/s3/key.pyc in set_contents_from_stream(self, fp, headers, replace, cb, num_cb, policy, reduced_redundancy, query_args, size)
   1093         if not provider.supports_chunked_transfer():
   1094             raise BotoClientError('%s does not support chunked transfer'
-> 1095                 % provider.get_provider_name())
   1096
   1097         # Name of the Object should be specified explicitly for Streams.

BotoClientError: BotoClientError: s3 does not support chunked transfer


dustin: Based on how much time I've spent already, I would like to go back to either storing this in memory or a file. Which of those would you prefer? I could add a cleanup() method that deletes the local file right after downloading it?
Indeed, it has nothing to do with the transfer-encoding of the source; *S3* doesn't support chunked transfer-encoding, which means that we can't upload from a stream without knowing the size in advance.  That's kind of lame, since S3 does support multipart uploads, but c'est la vie.  That also means

I think on-disk is best.  If you use tempfile.TemporaryFile, the file is automatically deleted for you and its space is freed when it is closed, so you don't need to worry about a cleanup task.

Here's some code that I whipped up (and ran successfully!):

import os
import requests
import tempfile
import boto
import shutil


url = "https://hg.mozilla.org/mozilla-central/archive/tip.tar.gz/testing"

# make the source request
resp = requests.get(url, stream=True)

# create a temporary file for it
dest = tempfile.TemporaryFile()

# copy the data, block-by-block, into that file
shutil.copyfileobj(resp.raw, dest, 10240)
resp.close()

print "got {} bytes".format(dest.tell())

# write it out to S3
s3 = boto.connect_s3()
bucket = s3.get_bucket('mozilla-releng-copy-test-use1')
key = bucket.new_key('mozharness.tar.gz')
dest.seek(0, os.SEEK_SET)  # rewind to start of the file
key.set_contents_from_file(dest, size=100)
Heh, I forgot about the first paragraph of that comment.  Honestly, set_contents_from_file can *almost* do this, but not quite - it tries to rewind the file to calculate the md5 hash.  So, boto's a bit lame here.
round 2, fight!

### from PR:

this consolidates my relengapi repo into the now unified build-relengapi and addresses feedback comments regarding:
- hardcoded repo and subdir: now supports any hg repo, branch, and subdir configuration
- hardcoded regions: regions are in the cfg
- blueprints no longer copy skeleton repo style of blueprints/name_of_blueprint.py and instead uses blueprints/name_of_blueprint/{__init__.py, etc}
- my settings.py cfg uses https not http
- I ended up downloading the src archive to disk but in a tempfile

from feedback that is not implemented:
- I did not end up using celery subtasks as I felt it may be overkill for the situation
- I also did not use celery's retry because in the event of a FAILURE, the current logic has it that if the s3 key being requested does not exist, the task to download and upload it will just be re-created again anyway. The need to support retry didn't seem overly important due to this.

if approach works for you, I'll add tests and docs (rst file)
Attachment #8614768 - Attachment is obsolete: true
Attachment #8617639 - Flags: review?(dustin)
Comment on attachment 8617639 [details] [review]
adds archiver endpoint to consolidated build-relengapi repo

re-setting review for now. need to add tests and docs first for validate.sh to be happy
Attachment #8617639 - Flags: review?(dustin)
Comment on attachment 8617639 [details] [review]
adds archiver endpoint to consolidated build-relengapi repo

PR is back open :)
Attachment #8617639 - Flags: review?(dustin)
Attachment #8617639 - Flags: review?(dustin)
Comment on attachment 8617639 [details] [review]
adds archiver endpoint to consolidated build-relengapi repo

pr open again. all issues addressed. i squashed the commits into a single commit so the interdiff is clear
Attachment #8617639 - Flags: review?(dustin)
Attachment #8617639 - Flags: review?(dustin) → review+
Depends on: 1178471
deployed onto staging/prod relengapi. thank you for bearing with me!

tested on my staging master:

examples (need vpn)
  - retrying and timing out when relengapi is unavailable or endpoint doesn't exist: http://dev-  master2.bb.releng.use1.mozilla.com:8037/builders/OS%20X%2010.7%20mozilla-central%20build/builds/24

  - retrying and timing out when in tree mozharness repo + rev doesn't exist: http://dev-master2.bb.releng.use1.mozilla.com:8037/builders/OS%20X%2010.7%20mozilla-central%20build/builds/25

  - successfully using archiver endpoint on a real repo + rev: http://dev-master2.bb.releng.use1.mozilla.com:8037/builders/OS%20X%2010.7%20mozilla-central%20build/builds/26
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Depends on: 1182532
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: