Closed Bug 1291940 Opened 8 years ago Closed 6 years ago

TaskCluster slows down considerably when using AUFS

Categories

(Taskcluster :: Workers, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: wcosta)

References

(Blocks 2 open bugs)

Details

In bug 1290282, I was investigating some oddities where c4.8xlarge instances barely had any performance wins over c3.4xlarge instances on TC Firefox build tasks despite having 2x the number of CPU cores.

I knew something was wrong after garndt posted dstat output while `mach build` was doing some heavy compilation:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 10   5  83   3   0   0| 705k   16M|   0     0 |   0     0 |  18k   43k
 39  15  42   4   0   0|   0    25M|  42k   44k|   0     0 |  52k  106k
 36   8  29  28   0   0|4096B   21M|  35k   34k|   0     0 |  38k   68k
 47   9  43   1   0   0|   0  4728k|  54k   56k|   0     0 |  48k   96k
 45  14  41   0   0   0|   0    28k|  25k   25k|   0     0 |  46k   93k
 42  14  44   0   0   0|   0    20k|  74k   74k|   0     0 |  53k  104k
 34  20  46   0   0   0|   0    36k|  77k   76k|   0     0 |  57k  111k
 35  19  45   0   0   0|   0  1376k|  53k   52k|   0     0 |  53k  104k
 37  20  43   0   0   0|   0  9456k|  79k   82k|   0     0 |  54k  104k
 28  27  46   0   0   0|   0    36k|  82k   81k|   0     0 |  54k  105k
 31  24  45   0   0   0|   0  8192B|  41k   41k|   0     0 |  51k   99k
 40  18  42   0   0   0|   0    28k|  44k   43k|   0     0 |  50k   97k
 36  21  42   0   0   0|   0  1444k| 135k  139k|   0     0 |  45k   88k
 33  21  46   1   0   0|   0    21M| 105k  100k|   0     0 |  47k   92k
 27  26  47   0   0   0|   0  2020k| 129k  132k|   0     0 |  52k  103k
 38  19  43   0   0   0|   0    32k|  90k   92k|   0     0 |  50k   96k

Note the very high sys (kernel) CPU usage and number of context switches. For comparison, here's what I got on a non-TC c4.8xlarge instance doing the same operation:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 93   6   1   0   0   0|   0    41M| 660B  832B|   0     0 |9801  4182
 92   7   1   0   0   0|   0  7320k|2016B   13k|   0     0 |9596  7047
 93   7   1   0   0   0|   0  8888k|2310B   10k|   0     0 |9567  4825
 91   8   1   0   0   0|   0  9248k|2574B   11k|   0     0 |9798  6004
 93   7   1   0   0   0|   0  7464k| 924B 3044B|   0     0 |9530  4706
 94   6   0   0   0   0|   0   102M| 198B  730B|   0     0 |  10k 4105
 94   5   1   0   0   0|   0   105M|1386B 8132B|   0     0 |  12k 7619
 92   7   1   0   0   0|   0    16M|1716B 6180B|   0     0 |9710  8397
 93   6   0   0   0   0|   0  4096k| 396B 1220B|   0     0 |9570  5332

The sys numbers are still a bit high for my liking. But context switches are *way* down, dare I say reasonable.

The TC team stood up a docker worker and gave me SSH access to investigate.

I initially thought OS and kernel differences might be to blame. My instance was running Ubuntu 16.04 (Linux 4.4). TC was running Ubuntu 14.04 (Linux 3.13) and had a number of outdated system packages. Upgrading system packages (including the kernel) and rebooting appeared to make things a bit faster. Although I would have to measure again to confirm.

I built Firefox from source outside of Docker and both my instance and TC instance had remarkably similar times:

  5:04 wall vs   5:03 wall
143:28 user vs 143:08 user
  9:18 sys  vs   9:29 sys

Inside Docker was a different story.

Building from a host-mounted ext4 volume was slightly slower:

  5:42 wall
136:49 user
 13:54 sys

But building from a Docker aufs volume was much slower:

  8:38 wall
132:19 user
 75:03 sys

dstat output during parts of the build was horrendous:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  9  82   9   0   0   0|   0    72k|3300B   15k|   0     0 |  20k   27k
  7  83  11   0   0   0|   0    64k|3366B   12k|   0     0 |  19k   29k
  3  88   9   0   0   0|   0    68k|1188B 5074B|   0     0 |  20k   29k
  5  86   9   0   0   0|   0    64k|  66B  358B|   0     0 |  25k   25k
 12  78  10   0   0   0|   0   576k|2574B   10k|   0     0 |  18k   27k
  6  80  14   0   0   0|   0    30M| 990B 3684B|   0     0 |  19k   28k
  4  91   4   0   0   0|   0    64k| 198B  722B|   0     0 |  21k   22k
  5  93   1   0   0   0|   0    64k| 132B  874B|   0     0 |  22k   21k
  9  85   6   0   0   0|   0    92k|2310B 8290B|   0     0 |  19k   26k
  4  87   9   0   0   0|   0  1316k| 198B  866B|   0     0 |  18k   25k
  3  95   3   0   0   0|   0    18M| 198B  714B|   0     0 |  21k   21k
  7  90   3   0   0   0|   0    64k| 990B 4310B|   0     0 |  17k   22k

95% CPU time in the kernel!!!

So it appears aufs under high load is a steaming pile. This shouldn't come as a surprise - aufs is an abandoned project and overlayfs is the equivalent technology that's part of the Linux kernel.

So where does this leave us? Well, TaskCluster is using aufs heavily in production. My understanding is that unless a cache is declared (caches use volumes mounted from the host and are backed by ext4), your I/O inside a TC task will be using aufs. And if aufs has performance problems, well, TC tasks will be slow.

The good news is a *lot* of Firefox automation uses caches (read: no aufs), so the negative impact of aufs is mitigated a bit. However, caches are stripped on most Try tasks because of concerns over cache poisoning. So e.g. a TC Firefox build will use ext4 on mozilla-central but aufs on Try. That means Try tasks will be slower than non-Try tasks. In theory, this applies to non-build tasks as well.

A potential silver lining is that the aufs performance degradation might only be for large CPU counts driving high concurrent I/O loads. In bug 1290282, c3.4xlarge instances were faster than c3.2xlarge instances (not linear as expected but still pretty good) whereas c3.8xlarge hardly had any advantage over c3.4xlarge despite double the cores. I'll have to run some tests on less beefy instances to see if aufs overhead is measurable on e.g. 1, 2, and 4 core instances, which make up most of the instances in TC. If it is, then this performance problem could result in thousands of hours of wasted machine time every day.

Per discussion in #taskcluster, jonasfj and garndt have ideas for addressing this at the worker level. I'll let them chime in.
We currently build on c3.2xlarge instances on an Ubuntu 14.04 host. I decided to compare the performance of a build in a c4.2xlarge (8 vCPUs) instance from at the host level and inside a Docker aufs volume to quantify the impact aufs has on TC tasks today.

Host build (ext3):

 19:30 wall
138:54 user
  7:23 sys

Docker build (aufs):

 20:15 wall
138:50 user
  9:33 sys

There appears to be slight overhead to aufs on a c4.2xlarge (8 vCPUs). Although this was a sampling size of 1, so results could be within variance. FWIW, I did witness rather high sys CPU time during the "export" tier and during parts of the "compile" tier. But it wasn't as high as it is in the c4.8xlarge. (I suspect "export" tier is bad because that tier creates thousands of files and symlinks and is effectively a stress test for I/O writing.)

On a c4.4xlarge (16 vCPUs):

Host build (ext3):

 10:59 wall
141:54 user
  7:28 sys

Docker build (aufs):

 11:47 wall
140:32 user
 15:00 sys

sys time is higher with aufs. dstat output also shows sys CPU time was higher throughout this build compared to the c4.2xlarge. Still not as bad as c4.8xlarge. But noticeably worse than c4.2xlarge.

The good news is aufs impact on c4.2xlarge instances appears minimal (the 45s gap between native and aufs remained intact - although not sure how that happened with ~5 minutes more kernel time). But the impact isn't horrendous like with c4.8xlarge instances. Since we don't run many instances with more than 8 cores in TC (do we run any at all?), this hopefully means we don't have major, widespread performance issues on TC due to aufs.

The bad news is that until we address this underlying problem with aufs, running high vCPU instances in TC will have diminishing returns for many workloads. We'll get *some* wins. But they won't be as large as expected because aufs will eat the extra CPU. Put another way, running high vCPU instances is, to some extent, throwing money away. c4.4xlarge has an arguably tolerable amount of aufs overhead. But we should definitely avoid c4.8xlarge instances for the time being, as aufs could eat a lot of their extra CPU.
In #taskcluster yesterday, I stated that the excessive kernel time appears to have gone away in Ubuntu 16.04. I was wrong. Kernel overhead on a c4.8xlarge on Ubuntu 16.04 is *worse* than 14.04!

Ubuntu 14.04 aufs (from comment #0):

  8:38 wall
132:19 user
 75:03 sys

Ubuntu 16.04 aufs:

 20:53 wall
103:03 user
461:28 sys  !!!

Ubuntu 16.04 overlayfs:

  5:49 wall
135:32 user
  9:39 sys

Building outside of Docker yielded the same time as within Docker on overlayfs. That's an encouraging sign for overlayfs.
Did you use the overlay or overlay2 docker storage driver?
overlay with docker 1.10 (the Docker that ships with Ubuntu 16.04).
There has been some talk of mitigating the impact of aufs by using aggressively using Docker volumes (which are essentially bind mounted from the host filesystem). This is a good idea in theory. So let's put that to the test.

On the same c4.8xlarge Ubuntu 16.04 instance used in comment #2, building Firefox with a source directory and object directory volume bind mounted from the host but still using aufs for the Docker container:

Ubuntu 16.04 host volume + aufs container:

 10:09 wall
111:27 user
117:41 sys

As you can see, we're somewhere between native/overlayfs and full aufs despite avoiding aufs for the Firefox source and output.

What gives? My theory is that the *reads* to the aufs container filesystem by processes like python and gcc loading standard library modules and headers is triggering aufs overhead. So even though we do hardly any write I/O to aufs, aufs is still slowing us down by interfering with the read path. FWIW, dstat reported little to no disk reads during the build. This is presumably because everything was in the page cache. But apparently the page cache isn't good enough for aufs: it still interferes with reading.

It's looking more and more like avoiding aufs entirely is the only path forward.
I built with overlay2 and it has nearly identical performance as overlay.
Depends on: 1293395
Blocks: thunder-try
We have what looks like to be an intermittent failure related to aufs in bug 1295474.

garndt: given the performance and stability implications of running aufs, what is the timeline for not using aufs?
Blocks: 1295474
Flags: needinfo?(garndt)
Also note that bug 1256004 was also likely due to aufs, although we were able to work around it.
I don't have a timeline really, but i can say I can start looking into this soon with others.  I think making the changes is easy enough, it's then validating that everything continues running as expected that might take a little time.
Flags: needinfo?(garndt)
Assignee: nobody → garndt
FWIW, I tried switching my local dev environment to overlay2 and I'm running into what looks like bugs due to the filesystem not behaving like a filesystem should. So it looks like that storage driver is still buggy :/

I've used devicemapper with some success. However, it is *really* slow at creating new image layers compared to aufs and overlayfs, which is a big deal for my local development environment which reminds tons of Docker images frequently. However, I don't think this is a major issue for TC because it treats Docker images as tarballs and doesn't do any of the fancy layering foo.

The matrix at https://docs.docker.com/engine/userguide/storagedriver/selectadriver/ also seems to indicate devicemapper might be the least crappy storage driver.
that's really good to know.  Is it worth it to try overlay2 on the TC instances or is it just way too far gone to be of use?
Everything coming out of Docker seems to indicate overlayfs/overlay2 is the future. But if there are bugs in the filesystem, I'm inclined to write it off.

I should clarify my comment about layer creation being slow. It's like 1-3s instead of say 100ms. This likely won't matter unless you are building or pulling images with many layers.
Another option is btrfs.
I upgraded to Docker 1.12 and haven't yet reproduced the overlayfs bug I was seeing before. It's possible I wasn't actually running overlay2 on the machine encountering the issues referenced in comment #10.

Regarding btrfs, I'll have to give it another try. I think its layer generation is also a bit slow. But, again, we shouldn't care about this so much.
Ahh - the issue I was having with overlay was related to hard links (some processes were confused by the presence of hard links where hard links should not have been present). The docs at https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/ indicate that "overlay" relied heavily on hard links and "overlay2" does not.

FWIW, the bottom of that page contains some nuggets about overlayfs, notably that open(2) doesn't conform to POSIX standards and rename(2) isn't supported either. Good times. A "real" filesystem like btrfs keeps sounding better and better.
UNIX domain sockets don't work with the overlay filesystem: https://github.com/docker/docker/issues/12080
garndt: I keep seeing people bit by AUFS bugs or performance issues. The problems aren't obvious because AUFS "just works" most of the time. I suspect the use of AUFS is a "silent killer" of overall performance on TaskCluster, especially on large-core instances (read: we're wasting money due to sub-optimal performance).

What's the story on transitioning off AUFS? Depending on the timeline, it may justify someone doing work to have tasks alert when they are doing obviously sub-optimal things on AUFS like not having the workspace be a cache. But that won't save us: frequently accessed files baked into the Docker image will still kill perf because they are hosted on AUFS. And Try tasks can't use caches easily, meaning they always get AUFS and are therefore slower.
Flags: needinfo?(garndt)
Sorry for the delayed response, I just returned from PTO and will be looking at this once I get caught up.  I will have to see where we are with other priority projects on our plate.  Last time I looked at this, I think I hit a rabbit hole of issues when trying to switch the fs drivers, but we can investigate with a new set of eyes.

I'm going to leave the ni? at least for a couple of days to keep it on my radar.
Switching ni? to wander, he will start looking into this.
Flags: needinfo?(garndt) → needinfo?(wcosta)
Component: Worker → Docker-Worker
Blocks: 1399895
Assignee: garndt → wcosta
Status: NEW → ASSIGNED
Flags: needinfo?(wcosta)
While I was hacking on bug 1415725, I cranked out a patch to switch docker-worker to overlay2. PR at https://github.com/taskcluster/docker-worker/pull/337.
Commit pushed to master at https://github.com/taskcluster/docker-worker

https://github.com/taskcluster/docker-worker/commit/feb14875f6dbee6cdbca619a280f4b56deeb398b
Upgrade to kernel 4.4 and use overlay2 storage driver

AUFS is the source of numerous performance problems and
filesystem consistency issues. Bug 1291940 has plenty of
details. The tl;dr is AUFS takes out a global kernel lock
for pretty much all I/O operations and concurrent I/O
grinds to a halt as you add more concurrency. Firefox builds
actually get slower as you use more cores. In addition, there
are consistency issues with AUFS that result in files randomly
not being evailable after write.

In this commit, I bump the kernel version from 3.13 to 4.4
(so overlayfs is usable), add support for the overlay2 storage
driver (added in Docker 1.12 - which we conveniently use),
and switch the default storage driver to overlay2.

The version of v4l2loopback was also bumped to the latest
in order to pull in fixes so it compiles with Linux 4.4.
wcosta just pushed this out ~2 hours ago. So all new worker instances should be using the new AMIs with Linux 4.4 and overlay2 storage driver.

I'm optimistic we'll see perfherder alerts as a result of this change. Could take a while for existing workers running AUFS to self-terminate, however.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-inbound,1445527,1,2&series=mozilla-inbound,1582370,1,2&series=mozilla-inbound,1446792,1,2 is showing a ~4 minute Linux64 opt build time drop for c4.4xlarge today. This will almost certainly generate a Perfherder alert once enough data comes in.
I rolled back AMIs today after a report of docker daemon failing to start in some instances. Connecting to a machine docker.log showed a failure to create logical volumes:

ls: cannot access /dev/nvme*n*: No such file or directory
umount: /dev/xvdb: not mounted
  No physical volume label read from /dev/xvdb
  Can't open /dev/xvdb exclusively.  Mounted filesystem?
  Unable to add physical volume '/dev/xvdb' to volume group 'instance_storage'.
  No physical volume label read from /dev/xvdb
  Can't open /dev/xvdb exclusively.  Mounted filesystem?
  Unable to add physical volume '/dev/xvdb' to volume group 'instance_storage'.
  No volume groups found
Creating logical volume 'instance_storage'
ls: cannot access /dev/nvme*n*: No such file or directory
umount: /dev/xvdb: not mounted

Tracking down the issue, the root cause was vgcreate trying to open /dev/xvdb exclusively and failing:
open /dev/xvdb exclusively open("/dev/xvdb", O_RDWR|O_EXCL|O_DIRECT|O_NOATIME) = -1 EBUSY (Device or resource busy)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Today a made a new rolled out the images after some tests yesterday. The gecko-t-linux-xlarge it is only work type that showed the problem so far.
* Today I rolled out
Per IRC discussion, it looks like failures in gecko-t-linux-xlarge were due to using an older, buggy AMI. The new AMIs appear to "just work."

wcosta: can we resolve this bug now?
Flags: needinfo?(wcosta)
(In reply to Gregory Szorc [:gps] from comment #27)
> Per IRC discussion, it looks like failures in gecko-t-linux-xlarge were due
> to using an older, buggy AMI. The new AMIs appear to "just work."
> 
> wcosta: can we resolve this bug now?

I found one case with newest AMI where the error happened. I connected to the machine, and after I rebooted it, the error was gone. It feels like there is a very subtle race condition on docker daemon initialization that produces the wrong behavior once in a while. Not sure if we should close this bug and investigate in a follow-up one, as it seems not to affect production. The only downside is Papertrail backlog.
Flags: needinfo?(wcosta)
If the incidence rate is low, let's get a new bug on file.

FWIW, I think the way we create the volume (as part of docker's initialization routine) is kinda wonky. IMO this should be a one-time operation performed as part of launching the EC2 instance. But we can discuss this in another bug.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
The problem seems to get a lot more frequent once I added c5 instances.
Rolled back AMIs, as we won't take advantage of c5 instances and there still occasional bustage with other instances. Didn't reverted the patches, however.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
New AMIs got rolled out today. Workers are still recycling. But it looks like things will stick this time.
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Blocks: 1424461
Blocks: 1425493
Component: Docker-Worker → Workers
You need to log in before you can comment on or make changes to this bug.