Closed Bug 1347956 Opened 7 years ago Closed 6 years ago

Some public-artifacts.taskcluster.net files are not served gzipped

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Unassigned)

References

Details

Compare:

curl -IL --compressed "https://queue.taskcluster.net/v1/task/dTgEO-dOQ4KOnc-ZJ6FLLQ/runs/0/artifacts/public/logs/live_backing.log"

curl -IL --compressed "https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log"

The former's response (after the HTTP 303) has `Content-Encoding: gzip` whereas the latter does not:

$ curl -IL --compressed "https://queue.taskcluster.net/v1/task/EnHkkQP8RN-7mJf9ZHo 4Ug/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log"
HTTP/1.1 303 See Other
...
Location: https://public-artifacts.taskcluster.net/EnHkkQP8RN-7mJf9ZHo4Ug/0/public/test_info//reftest-no-accel_errorsummary.log

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 3457715
Connection: keep-alive
Date: Thu, 16 Mar 2017 14:50:42 GMT
Last-Modified: Thu, 16 Mar 2017 00:39:28 GMT
ETag: "2fb988221e25255927d1964f1d728ece"
x-amz-version-id: f2EfsMJdWE56i2PLwJrUhZDkhV8s0f8f
Accept-Ranges: bytes
Server: AmazonS3
Age: 1467
X-Cache: Hit from cloudfront
Via: 1.1 d2458a4f6c40dc75a7c4afbe573a1387.cloudfront.net (CloudFront)
X-Amz-Cf-Id: 3QbcTOOnG8RzGsktdrVXJmwGXeEN4VElEUii54oZV9CGJO75Qvf1fw==


This file is uploaded via the `MOZ_UPLOAD_DIR` mechanism, and so possibly by this file?
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/blob_upload.py

Whilst I know in some places we compress before upload to S3, Cloudfront can automatically use gzip too:
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html#compressed-content-cloudfront-file-types

Greg, I don't suppose you could redirect to the best person to look at this? :-)
(It would help with the slow transfer performance in eg bug 1347945)
Flags: needinfo?(garndt)
So the links to the live_backing.log are compressed, the workers do that because that's a file the worker maintains and knows to upload it compressed.

The worker makes no assumption about the compression to use when uploading files that a task creates.  We could probably just assume that all plain/text artifact should be gzipped.

Jonas, thoughts?
Flags: needinfo?(garndt) → needinfo?(jopsen)
We should look into this at some point.

But compression is a storage vs. cpu cycle cost.
Flags: needinfo?(jopsen)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #2)
> But compression is a storage vs. cpu cycle cost.

If we were talking about archival cost only, yes.

But there's the network transfer for the multiple consumers of the file to consider.
(In reply to Greg Arndt [:garndt] from comment #1)

> The worker makes no assumption about the compression to use when uploading
> files that a task creates.  We could probably just assume that all
> plain/text artifact should be gzipped.

What about all files, regardless of mime type?
Blocks: 1295997
I had a chat with Ed on IRC here: http://logs.glob.uno/?c=ateam#c1107025

09:37	pmoore	emorley: jonasfj: do you see any reason why we shouldn't use gzip content encoding for *all* artifacts (not just text/plain ones)?
09:39	pmoore	.... we could have a sanity check that if the compressed version is larger than the uncompressed version, we don't - but maybe that never even happens
09:41	pmoore	.... i also can't imagine that compression / decompression costs a lot in wall-time - and if it does, probably it is because of enormous artifacts which would then be served more quickly as a result ...
09:41	emorley	pmoore: for a point of reference, whitenoise takes the "blacklist rather than whitelist" and "check compression ratio <95%" approach github.com/evansd/whitenoise/blob/…88a/whitenoise/compress.py#L21-L29
09:42	emorley	and logs cases where files were skipped due to compression ratio not being good enough
09:42	emorley	(where those logs then allow for figuring out which cases to add to the blacklist)
09:42	pmoore	ah, nice
09:42	pmoore	that seems like a sane approach to me
10:03	pmoore	emorley: in the absence of any objections, i'll implement like this in generic-worker - the same blacklist, and the same <95% compression ratio requirement
10:04	emorley	sgtm
It's worth noting that some clients (curl, notably) do not automatically decode HTTP responses, so this may break a lot of things.
urllib2 being another example
Good point, we'll need to add `--compressed` to the relevant curl invocations.

(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> urllib2 being another example

Another reason for switching to requests :-)
There are probably hundreds of those across lots of projects using TC, so I'm not sure we could do so effectively.
Given that, the simpler alternative might be to do compression on demand:

(In reply to Ed Morley [:emorley] from comment #0)
> Whilst I know in some places we compress before upload to S3, Cloudfront can
> automatically use gzip too:
> https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/
> ServingCompressedFiles.html
> https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/
> ServingCompressedFiles.html#compressed-content-cloudfront-file-types
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/87ab14f1770706cd95eef630a351ecaaa96b828b
Bug 1347956 - blacklist certain file extensions from useing gzip content encoding

https://github.com/taskcluster/generic-worker/commit/f4381839feb46103e1e3eef64172425ea5e29b2e
Merge pull request #49 from taskcluster/gzip-more-artifacts

Bug 1347956 - gzip more artifacts
I've added the blacklist in generic-worker, which should get rolled out to all worker types (OS X, Windows) in the next days. We'll still need to do this in taskcluster-worker and docker-worker (and maybe some releng workers too - buildbot bridge / signing worker / script worker etc).

At the moment I don't check for at least 95% compression - I've just implemented the part with the blacklist, and gzip compress everything else.
Just seen the comments between Ed and Dustin. To be honest, since we are already gzipping some artifacts, and not others, I expect that anything reading artifacts probably is already handling this. Mostly I guess people will fetch logs, which already had gzip encoding.

I also think, the first time they run something and it pulls back binary content, they will quickly work out it is compressed. So I'm hoping this change won't break too many things.
Another angle on this, is I think content-encoding is not an obscure thing, it is a standard header, that clients should respect - so if it was something less obscure, I would tend to agree, but I think it is reasonable to expect clients to respect the well-defined semantics of the Content-Encoding header.

I say, let's roll with it, and back stuff out if it causes pain, but not let the worry that something somewhere might be affected prevent us from making progress. I think a disruptive approach has the potential to have a more positive impact.
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> urllib2 being another example

Funny you should mention that! I discovered mozharness is using urllib2 and therefore doesn't support Content-Encoding when downloading artifacts - so I'm adding a fix for that in bug 1358142.
Hi Pete!

Do you know if the changes here have been deployed? 

I'd somewhat forgotten about this bug since I thought it had been deployed, but recent slow transaction traces for Treeherder's store-failure-lines task showed 89% of the profile fetching in requests fetching from eg:

https://queue.taskcluster.net/v1/task/VQaPi7msQIGTK9HF1PnPcw/runs/0/artifacts/public/test_info//reftest-no-accel_errorsummary.log

...which isn't served gzipped:

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 3479668
Connection: keep-alive
Date: Wed, 07 Jun 2017 10:10:44 GMT
Last-Modified: Wed, 07 Jun 2017 09:35:50 GMT
ETag: "0a41399ab3767f1b27799b47f50277cb"
x-amz-version-id: TiK42srJM9OeCa6AFuZ6VASC6HgkRQ5l
Accept-Ranges: bytes
Server: AmazonS3
X-Cache: Miss from cloudfront
Via: 1.1 c771900addaa417be1d0b79ff157a3f9.cloudfront.net (CloudFront)
X-Amz-Cf-Id: tOhx8tonD9be2JSFd9pjgYqtk_QnN-F8IWNSqyiVxjeAP-z8cwRmuw==
Flags: needinfo?(pmoore)
Changes for this have been deployed for generic-worker, but not docker-worker (which ran the task linked to).  I thought there might have been a discussion if this is behavior we wanted everywhere but I cannot find a link to it.  Perhaps Pete recalls that discussion.
I think that we haven't discussed docker-worker yet, but my opinion is that it would make sense to implement in docker-worker, if it isn't a huge piece of work. Otherwise we should certainly make sure we have it in taskcluster-worker.

@Jonas,
how far off replacing docker-worker with taskcluster-worker are we, at a guess (6 months/1 year/...)? Do we still intend to replace it? My thinking here is, if we are close to replacing, maybe better just to roll out in taskcluster-worker.

@Ed,
Do you have an idea how much cost this might save overall, or if it would have a big performance impact? My guess is, including picking up the pieces if they break, we're talking a couple of days for making the fix(es), testing and reviewing them, and then rolling out to all our worker types, taking care of bustage etc. Maybe 3 days in total, if things don't go smoothly?
Flags: needinfo?(pmoore)
Flags: needinfo?(jopsen)
Flags: needinfo?(emorley)
Thank you for the further details.

Just to check I understand, we have:
* generic-worker -> gzip support added in this bug (comment 11)
* docker-worker -> doesn't yet use gzip, but may be replaced soon by taskcluster-worker but timeline unclear
* taskcluster-worker -> appears to already gzip (added by bug 1305707), is already used for certain tasks (and might be used for more if it replaces docker-worker)

(In reply to Pete Moore [:pmoore][:pete] from comment #18)
> Do you have an idea how much cost this might save overall, or if it would
> have a big performance impact?

From Treeherder's point of view, the concern is not the cost (we don't pay for the S3 bucket or data transfer) but about reducing the proportion of the time spent fetching the json error summary files, particularly when S3 is being slow. Looking at the last 7 days, the store-failure-lines task profile shows 81% of the profile spent in the request to queue.taskcluster.net (though some of that is due to things like bug 1348071 / bug 1348072).

I'd imagine there are cost savings (storage + transfer) to be had for the Taskcluster team - for example this file as served is 236 MB but only 9.5 MB using default CLI gzip options:
https://queue.taskcluster.net/v1/task/DBYbDF8RR8CrQL9tREtDaA/runs/0/artifacts/public/test_info/reftest-no-accel_errorsummary.log

> My guess is, including picking up the pieces
> if they break, we're talking a couple of days for making the fix(es),
> testing and reviewing them, and then rolling out to all our worker types,
> taking care of bustage etc. Maybe 3 days in total, if things don't go
> smoothly?

Yeah I agree it's not going to be as straight forwards as it might seem. I'm happy to wait until docker-worker is replaced if that's not years away.

My main concern is that lack of gzipping is something that seems to continually bite us. It would be great to have it considered as an essential feature the next time a new worker project is created, so we don't need to try and add it retrospectively :-)
Flags: needinfo?(emorley)
(In reply to Ed Morley [:emorley] from comment #19)
> Thank you for the further details.
> 
> Just to check I understand, we have:
> * generic-worker -> gzip support added in this bug (comment 11)

Yes, and that landed in generic-worker 8.3.0. That means the fix is currently running on our Windows builders (win2012), and our Windows 10 machines, but not yet on Windows 7 testers.

> * docker-worker -> doesn't yet use gzip, but may be replaced soon by
> taskcluster-worker but timeline unclear

I believe it uses it for log files, but not for other artifacts. garndt can confirm.

> * taskcluster-worker -> appears to already gzip (added by bug 1305707), is
> already used for certain tasks (and might be used for more if it replaces
> docker-worker)

I believe that is only for the log file. Currently taskcluster-worker isn't used for anything yet in production, as far as I know.

> 
> (In reply to Pete Moore [:pmoore][:pete] from comment #18)
> > Do you have an idea how much cost this might save overall, or if it would
> > have a big performance impact?
> 
> From Treeherder's point of view, the concern is not the cost (we don't pay
> for the S3 bucket or data transfer) but about reducing the proportion of the
> time spent fetching the json error summary files, particularly when S3 is
> being slow. Looking at the last 7 days, the store-failure-lines task profile
> shows 81% of the profile spent in the request to queue.taskcluster.net
> (though some of that is due to things like bug 1348071 / bug 1348072).
> 
> I'd imagine there are cost savings (storage + transfer) to be had for the
> Taskcluster team - for example this file as served is 236 MB but only 9.5 MB
> using default CLI gzip options:
> https://queue.taskcluster.net/v1/task/DBYbDF8RR8CrQL9tREtDaA/runs/0/
> artifacts/public/test_info/reftest-no-accel_errorsummary.log
> 
> > My guess is, including picking up the pieces
> > if they break, we're talking a couple of days for making the fix(es),
> > testing and reviewing them, and then rolling out to all our worker types,
> > taking care of bustage etc. Maybe 3 days in total, if things don't go
> > smoothly?
> 
> Yeah I agree it's not going to be as straight forwards as it might seem. I'm
> happy to wait until docker-worker is replaced if that's not years away.
> 
> My main concern is that lack of gzipping is something that seems to
> continually bite us. It would be great to have it considered as an essential
> feature the next time a new worker project is created, so we don't need to
> try and add it retrospectively :-)

Hindsight is a blessing, hey! :) But yes, indeed....

Thanks for the clarifications.

Note, I hope to roll out a newer generic worker to the windows 7 machines over the next week if I don't find any major problems (I can't promise), in which case then all generic worker workers will be using gzip compressed artifacts (when artifact is not blacklisted as already being compressed).
tc-worker won't be replacing docker-worker anytime soon, we're likely focusing on other uses first.

Note, we should probably declare somewhere that some artifacts are gzipped others aren't, and that consumers of the API is required to support Content-Encoding: gzip.
Flags: needinfo?(jopsen)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #21)
> Note, we should probably declare somewhere that some artifacts are gzipped
> others aren't, and that consumers of the API is required to support
> Content-Encoding: gzip.

That's in the new manual.
Found in triage.

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #22)
> (In reply to Jonas Finnemann Jensen (:jonasfj) from comment #21)
> > Note, we should probably declare somewhere that some artifacts are gzipped
> > others aren't, and that consumers of the API is required to support
> > Content-Encoding: gzip.
> 
> That's in the new manual.

Is documentation enough to close this out in this case?
Flags: needinfo?(dustin)
Sure?
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(dustin)
Resolution: --- → FIXED
Component: Platform and Services → Services
You need to log in before you can comment on or make changes to this bug.