Closed Bug 1443590 Opened 6 years ago Closed 6 years ago

Use clang-cl for Windows builds we ship to users

Categories

(Firefox Build System :: General, enhancement)

enhancement
Not set
normal

Tracking

(relnote-firefox 63+, firefox63 fixed)

RESOLVED FIXED
mozilla63
Tracking Status
relnote-firefox --- 63+
firefox63 --- fixed

People

(Reporter: ted, Assigned: away)

References

(Depends on 1 open bug)

Details

Attachments

(2 files)

We've discussed this on several occasions but apparently nobody ever filed a bug.

There are a number of reasons using clang-cl for our official builds would be good:
* The ability to patch the toolchain as we do on other platforms
* The ability to diagnose and upstream fixes for compiler bugs we encounter
* It would provide motivation to switch to clang on all platforms, which would make it so we only have to worry about a single compiler for things like performance optimization (I don't assume we would drop support for MSVC or gcc, we just wouldn't have to worry as much about compiler quirks)

bug 752004 covers issues with using clang-cl to build Firefox, but we have a static analysis build stood up currently so I don't know how much more work it would be to get things to a shippable state.

We would need to do performance testing. I know dmajor has looked into this in the past.

Chrome switched their Windows builds over and they recently hit the release channel:
http://blog.llvm.org/2018/03/clang-is-now-used-to-build-chrome-for.html?m=1
An additional advantage is that this will enable C++/Rust cross language inlining using LTO.
I assume we'd also need to be using lld for that to work, right? I don't know if we'd want to block on that initially (it doesn't sound like Chrome did?) but I have no particular opinion.
We may have no choice but to use LTO, to get performance on par with MSVC PGO builds. Our experience has been very different from Chrome's in this regard.
Depends on: WinLTO
Hi, I worked on clang-cl, and on switching Chromium to it. We think it'd be super cool if Firefox was able to use clang-cl.

You might be able to use the technique described in https://bugs.chromium.org/p/chromium/issues/detail?id=728324#c7 to figure out where LTO would do the most for you, and do that manually for a few hotspots.

Do you have any data on where you see big perf differences? In specific areas, or all over the place? I'm surprised you're seeing a big difference here. Do you use LTCG/PGO with MSVC? (We did do that though and didn't need it in clang builds.) What are your /O flags? Do you build with exceptions enabled? RTTI? Do you export lots of symbols, or few? Do you build many dlls, or just one or two binaries?

I'd guess that there's some smallish amount of work you could do to be able to switch just the compiler, without regressing perf much.
First, thank you for working on clang-cl! Countless projects will benefit from having a robust and viable open source toolchain on Windows. A rising tide does indeed lift all ships.

I can answer some of your specific questions...

An example Firefox Nightly 64-bit build can be seen at https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=bccdc684210431c233622650a91454c09f6af9eb&selectedJob=166318973. https://taskcluster-artifacts.net/dhrUqAVcS0aWfRm-E3EUDw/0/public/logs/live_backing.log is a log of the build. The target.zip file contains the build output.

We have two main binaries: firefox.exe and xul.dll. firefox.exe is a relatively small binary and most of the code is in xul.dll. (firefox.exe is ~256kb and xul is ~33mb.) There are some other dlls for things like media/graphics support.

Example cl flags for the initial PGO build phase are: -MD -utf-8 -TP -nologo -Zc:sizedDealloc- -D_HAS_EXCEPTIONS=0 -W3 -Gy -Zc:inline -Gw -GR- -Zi -O1 -Oi -Oy- -GL   -Fdgenerated.pdb -FS

Our final link.exe invocation is something like: z:/build/build/src/vs2017_15.4.2/VC/bin/Hostx64/x64/link.exe -NOLOGO -DLL -OUT:xul.dll -PDB:xul.pdb -SUBSYSTEM:WINDOWS,6.01 -MACHINE:X64 StaticXULComponentsStart.obj nsDllMain.obj ./module.res -LARGEADDRESSAWARE -NXCOMPAT -RELEASE -DYNAMICBASE -DEBUG -DEBUGTYPE:CV -DEBUG -OPT:REF,ICF -DELAYLOAD:comdlg32.dll -DELAYLOAD:netapi32.dll -DELAYLOAD:secur32.dll -DELAYLOAD:wininet.dll -DELAYLOAD:winspool.drv -DELAYLOAD:oleacc.dll -DELAYLOAD:msdmo.dll -DELAYLOAD:api-ms-win-core-winrt-l1-1-0.dll -DELAYLOAD:api-ms-win-core-winrt-string-l1-1-0.dll -LTCG:PGUPDATE -CGTHREADS:8

And some of its output:

23:52:28     INFO -     Creating library xul.lib and object xul.exp
23:52:28     INFO -  Generating code
23:52:28     INFO -  0 of 0 ( 0.0%) original invalid call sites were matched.
23:52:28     INFO -  0 new call sites were added.
23:52:28     INFO -  360 of 763359 (  0.05%) profiled functions will be compiled for speed, and the rest of the functions will be compiled for size
23:52:28     INFO -  3296496 of 8112752 inline instances were from dead/cold paths
23:52:28     INFO -  763258 of 763359 functions (100.0%) were optimized using profile data, and the rest of the functions were optimized without using profile data
23:52:28     INFO -  35039334969 of 35039334969 instructions (100.0%) were optimized using profile data, and the rest of the instructions were optimized without using profile data
23:52:28     INFO -  Finished generating code

What's really impressive is those 0.05% of PGO optimized functions make a big difference. I don't have a link on hand to our tool to compare performance results. But I'm sure another engineer will provide one...
In Chromium, we build the UI with -O1, some of the perf-critical parts (skia, base, etc -- https://cs.chromium.org/search/?q=optimize_max+file:%5C.gn&sq=package:chromium&type=cs) with -O2. Maybe your using -O1 globally explains some of the effect.


Is there a way to get link.exe to tell you which 0.05% of functions were optimized for speed? https://blogs.msdn.microsoft.com/vcblog/2018/01/04/visual-studio-2017-throughput-improvements-and-advice/ mentions the linker flag /d2:-cgsummary , maybe it prints something useful? (Currently not near a Windows box to check myself.)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a2029584482a81edf2be034622d6c341031a271 will test /d2:-cgsummary in PGO link flags. That will take ~90 minutes for the build to complete.
We use -O2 within the JS engine: https://searchfox.org/mozilla-central/rev/efce4ceddfbe5fe4e2d74f1aed93894bada9b273/js/src/old-configure.in#779 and -O1 -Oi everywhere else.

Without LTO, our plain clang-cl + link.exe builds lag far behind MSVC PGO builds. I can dig up numbers later this week if there is interest.

The best contender I have been able to produce with clang-cl is: -O2 everywhere, ThinLTO with LLDLTO=3, and an order file produced with the Chromium instrumentation. Here is the perf comparison of that build against MSVC PGO: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6e33c533c935&newProject=try&newRevision=a18de4d58e27&framework=1&showOnlyConfident=1

32-bit comes out a little bit ahead, but 64-bit still lags by generally 3-9%. It troubles me that we had to add LTO even to get here. Given Chrome's experience, I was hoping to get perf parity with merely clang-cl and link.exe, and that LTO would just be icing on the cake. It makes me wonder if we're doing something wrong in our build config.

Also, the builds suffer some other problems that I haven't had time to investigate: I had to link with -FORCE to get around some unresolved symbols under ThinLTO, and the binary sizes are much larger than MSVC's. I tried building with -Os and the results were horribly slow.

To further close the gap I tried getting clang-cl to do PGO in bug 1341525, but I'm hitting mysterious linking failures: http://lists.llvm.org/pipermail/llvm-dev/2018-February/121312.html
That treeherder link does look impressively bad. Did you look at some of the tests (say, tpaint) in a profiler with for both builds and compared where the slowdown is? The good news about regressions so large is that they're often at least easy to see in profiles.

(We're seeing large binary size with ThinLTO too; one of the reasons we don't ship that yet.)
Attached file cgsummary for 32-bit
Here is the xul.dll link.exe excerpt with /d2:-cgsummary.
I tried profiling one test last month, I don't remember which. There wasn't really a single obvious slowdown and I didn't see anything immediately actionable other than "get PGO working" (e.g. I saw that MSVC made JS float paths be cold, presumably based on profile data).

In any case it would probably be worth doing more of, in case other tests do point to more interesting differences.
Depends on: 1443712
Re comment 10: Hm, looks like the cgsummary output only prints functions that it spends a long time codegen'ing, not the ones that it optimizes for speed, if I'm reading the output right. Oh well.
Depends on: 1443827
can you get an updated set of try pushes- I wanted to look at more details of the logs before/after this change to see if anything stood out from a perf point of view.
Depends on: 1444171
> My ThinLTO push failed to build:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b331851aafb

I tracked this down, filed LLVM 36686, and hacked around it to get unblocked: https://hg.mozilla.org/try/file/fd6e58d13893/build/build-clang/no-COFF-dso_local.patch

With ThinLTO at LLDLTO=3, compilation at O2 (not sure if that matters with LLDLTO?), and an order file, 32-bit has a lot more wins than regressions: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fbb4a1532a2c&newProject=try&newRevision=fd6e58d13893&framework=1&filter=32&showOnlyConfident=1

64-bit for the most part still lags by single-digit percentages: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fbb4a1532a2c&newProject=try&newRevision=fd6e58d13893&framework=1&filter=64&showOnlyConfident=1

Note that we measure startup I/O (which for these purposes means code size) only on 32-bit builds. That's why 64-bit doesn't list a regression, though I'm sure there is one.
Thanks for reaching out to help with our efforts, Nico!

(In reply to thakis from comment #9)
> (We're seeing large binary size with ThinLTO too; one of the reasons we
> don't ship that yet.)

Is this something that your team is currently working on to improve?  It would be nice for us to know how much we should count on being able to use ThinLTO in the near future...
(In reply to :Ehsan Akhgari from comment #16)
> Is this something that your team is currently working on to improve?  It
> would be nice for us to know how much we should count on being able to use
> ThinLTO in the near future...

LLVM r327563 enables function sections and data sections when using LTO, which should help with this.
(In reply to inglorion from comment #17)
> LLVM r327563 enables function sections and data sections when using LTO, which should help with this.

Thanks for the pointer, it does seem to have helped a bit. Compared to my previous ThinLTO push:

installer size opt
windows2012-32   72,214,618.00 	> 	70,771,796.00 	-2.00%
windows2012-64   75,703,302.00 	> 	73,595,802.00 	-2.78%

tp5n main_startup_fileio opt e10s stylo
windows7-32   	112,776,648.20 ± 0.00% 	> 	108,128,306.20 ± 0.00% 	-4.12%
(In reply to :Ehsan Akhgari from comment #16)
> Thanks for reaching out to help with our efforts, Nico!
> 
> (In reply to thakis from comment #9)
> > (We're seeing large binary size with ThinLTO too; one of the reasons we
> > don't ship that yet.)
> 
> Is this something that your team is currently working on to improve?  It
> would be nice for us to know how much we should count on being able to use
> ThinLTO in the near future...

We're looking at ThinLTO some (see e.g. the change inglorion linked to), but I'd say it's currently not the main priority. Once we've switched to LLD it'll become more important. We believe that it's important to roll out these things incrementally and not switch everything all at once. I'd guess that with some profiling you might be able to switch just the compiler at first too, but I might be wrong.
It looks like Skia has started dropping support for building some features that we use with MSVC: https://skia.googlesource.com/skia/+/22e536e3a1a09405d1c0e6f071717a726d86e8d4
(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)
> It looks like Skia has started dropping support for building some features
> that we use with MSVC:
> https://skia.googlesource.com/skia/+/22e536e3a1a09405d1c0e6f071717a726d86e8d4

Note that the consequences of the upstream Skia change are that a few files within Skia need to be compiled with Clang, but that doesn't require *all* of Skia to be compiled with Clang. So long as we could just link the resulting object files into the build.
Depends on: 1447727
Depends on: 1448976
Depends on: 1448978
No longer depends on: 1448976
Depends on: 1448306
Depends on: 1463190
Depends on: 1465546
Depends on: 1465633
Attached patch clang-clSplinter Review
With the landing of bug 1463190 and bug 1444171 I believe we have enough pieces in place to start getting real-world data on clang-cl builds. It's possible that this patch will bounce a few times but we have to start somewhere.

Putting these changes in `mozconfig.vs-latest` looks pretty silly but makes for easy reverts and re-lands if needed. We can fix up the filenames at a more stable point down the road.
Assignee: nobody → dmajor
Attachment #8991025 - Flags: review?(gps)
Do I need to touch CLOBBER with these changes?
I don't think so--all of our builds in CI are clobber builds currently and this doesn't impact local developer builds.
Comment on attachment 8991025 [details] [diff] [review]
clang-cl

Review of attachment 8991025 [details] [diff] [review]:
-----------------------------------------------------------------

This is mostly a rubber stamp / sign-off review, since the patch itself is trivial.

Obviously this change has broad implications. And we expect various things to change/break.

My biggest concern is that the appropriate people are aware of this change so any fallout can be more easily attributed to it. Per #build earlier today:

17:31 <@gps> dmajor: did you or anyone engage relman about bug 1443590? are you planning any comms to them when this lands?
17:32 < firebot> https://bugzil.la/1443590 — NEW, dmajor@mozilla.com — Use clang-cl for Windows builds we ship to users
17:34 < dmajor> gps: My understanding is that ajones has been talking to them. I guess you could flag him for confirmation
17:53 < ajones> gps: relman folks have told me to enable clang-cl on nightly
17:54 < ajones> to be specific, lizzard and sylvestre
17:56 <@gps> ajones: thanks for the confirmation.

Since it appears relman has blessed landing this, I think our ducks are in a row. Let's do it.

But please also follow-up with an email to appropriate mailing lists so everyone is aware when this lands and when the first Nightlies will be going out.

Thank you and everyone else who helped for your hard, persistent work on this project. Using an open source toolchain on Windows and doing Windows cross-builds has been a dream for years. This patch is a major milestone towards that long-term ideal.
Attachment #8991025 - Flags: review?(gps) → review+
Pushed by dmajor@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/01380eb6a6d0
Use clang-cl for official Windows builds. r=gps
Can we make sure this won't ride the train to beta until it is all green.
We need a severe QA sign off and outreach to AV editors before moving it to beta.
(In reply to Sylvestre Ledru [:sylvestre] from comment #27)
> Can we make sure this won't ride the train to beta until it is all green.
> We need a severe QA sign off and outreach to AV editors before moving it to
> beta.

We certainly could.

But I'd be scared to put this behind a "Nightly only" config knob because if we do, then we'll have little continuity between the performance/behavior of builds on Nightly when they go to Beta. i.e. we'll be using Clang on Nightly and on first Beta we'll switch to MSVC. That seems risky to me since we'd have ~weeks of changes without significant testing on MSVC. I think it would be safer if we established a flag day - say the beginning of soft freeze - where we make a decision to let Clang ride the trains or back it out on Nightly. Then we'll have some period of MSVC builds on Nightly at the end of a cycle where we can find any problems before we do our first Beta build.

Because of the implications, I want to be sure we're on the same page and understand the risks.

Is your request to have a "Nightly only" configuration with little MSVC coverage on Nightly/Beta? Or do you want us to revert to MSVC before uplift so there is some testing continuity for MSVC?
Flags: needinfo?(sledru)
https://hg.mozilla.org/mozilla-central/rev/01380eb6a6d0
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
So, with this change, all windows builds are using clang-cl. Which means we're now doing two opt windows builds with clang-cl per push, since both opt and pgo are (presumably) doing the same thing. Or are the pgo builds using an order file and the opt builds not?
AFAICT the order file bits are conditional on PGO. So Windows PGO is now no longer PGO, but clang + order file.

It might make sense to rename the builds. But depending how far down that rabbit hole we want to go, there could be side-effects (e.g. to tools looking for builds at specific TC index paths). It is definitely easier to keep them named "PGO" for now. We may want to sort everything out once the Clang transition is complete. Otherwise I fear we'll cause needless churn.
I was more concerned about opt and pgo potentially being the same thing, but if they're not, then that's less problematic.
Wow, this made it to a Nightly very quickly! 2017-07-10 (PM) is the first clang-cl Nightly.
> 2017-07-10 (PM) is the first clang-cl Nightly.

2018, of course :-)
Build times boosts on Windows:

== Change summary for alert #14277 (as of Wed, 11 Jul 2018 03:04:21 GMT) ==

Improvements:

 11%  build times windows2012-32 debug taskcluster-c4.4xlarge     2,345.17 -> 2,078.53
  9%  build times windows2012-64 debug taskcluster-c4.4xlarge     2,330.82 -> 2,114.59
  3%  build times windows2012-64 opt taskcluster-c4.4xlarge       2,175.09 -> 2,108.32
  3%  build times windows2012-32 opt taskcluster-c4.4xlarge       2,127.05 -> 2,067.04

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=14277
Big perf improvements according to Talos:

== Change summary for alert #14271 (as of Tue, 10 Jul 2018 17:22:52 GMT) ==

Improvements:

 15%  stylebench windows10-64-qr opt e10s stylo     37.44 -> 42.94
  8%  speedometer windows10-64-qr opt e10s stylo    60.44 -> 65.42
  5%  damp windows10-64 opt e10s stylo              96.59 -> 91.80

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=14271
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #36)
> Big perf improvements according to Talos:
>
>   5%  damp windows10-64 opt e10s stylo              96.59 -> 91.80

From DAMP perspective (Perf tests dedicated to DevTools), this change made "opt" builds between 10 to 25% faster depending on devtools feature we are testing. And it made PGO builds between 6 to 12% faster.
The new "opt" looks as fast as PGO-pre-clang. And the PGO-post-clang is slightly faster. Now, the difference between PGO and opt is smaller:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=mozilla-central,1655181,1,1&series=mozilla-central,1654655,1,1
An important point is that it improved every single of our tests and did not regressed a single one.
Depends on: 1474860
Note that the clang "pgo" builds currently in automation are only pseudo-pgo: the training step generates an order file (bug 1444171) to optimize the ordering of functions, but we don't actually optimize the function contents. Enabling proper PGO is tracked by bug 1341525.
Blocks: 1475384
Congrats on flipping the flag! Very happy to see this :-)

One question:

(In reply to David Major [:dmajor] from comment #8)
> The best contender I have been able to produce with clang-cl is: -O2
> everywhere, ThinLTO with LLDLTO=3, and an order file produced with the
> Chromium instrumentation. Here is the perf comparison of that build against
> MSVC PGO:
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=6e33c533c935&newProject=try&newR
> evision=a18de4d58e27&framework=1&showOnlyConfident=1
> 
> 32-bit comes out a little bit ahead, but 64-bit still lags by generally
> 3-9%. It troubles me that we had to add LTO even to get here. Given Chrome's
> experience, I was hoping to get perf parity with merely clang-cl and
> link.exe, and that LTO would just be icing on the cake. It makes me wonder
> if we're doing something wrong in our build config.

...but then https://bugzilla.mozilla.org/show_bug.cgi?id=1474860#c3 measures pretty solid perf improvements instead (which is what I would've expected).

How does this fit together? Are treeherder and the perf tests over there measuring different things? Have you done any more compiler perf work that moved the metrics in the meantime?
(In reply to thakis from comment #39)
> How does this fit together? Are treeherder and the perf tests over there
> measuring different things? Have you done any more compiler perf work that
> moved the metrics in the meantime?

The only explanation I can come up with is that the trunk compiler got better. I had been doing periodic comparisons and one day a couple months ago the numbers started looking really competitive. It's an unsatisfying answer, I know. I probably could have bisected it, but as a practical matter I decided to drop everything and race to flip the switch before our test suite changes its mind.
Greg, antivirus are our biggest source of issue by far. 
I am concerned that not giving them enough time to update their software to deal with the new binary layout is going to impact us in a bad way. This is why I would like to take our time to make that change and ship 63 with MSVC.
As you suggested, we should make the switch back to MSVC right before the soft freeze.

I forwarded you the PI request from Anthony.
Flags: needinfo?(sledru)
Version: Version 3 → Trunk
(In reply to Sylvestre Ledru [:sylvestre] from comment #41)

Do we have reason to believe that delaying the release will make a difference? In other words, are the AV vendors actively cooperating with us on this? I don't want to end up in a vicious cycle where we restrict clang to Nightly and the AVs feel no priority to work on it because "it's only Nightly".
I replied in private to David. We have good reasons but not blocking reasons.

Having a spot check from QA asap on a bunch of AV with different versions would be a great start to understand the impact.
> to deal with the new binary layout

Wait what? the binary layout changes between every build with msvc. It changes more with the switch to clang-cl, but it's a matter of how much change there is, not whether there is change at all. So how can this matter at all?
Depends on: 1341525
Sorry Mike, I missed your answer. In the past, we had some issues with AV editors doing DLL injection or expecting some specific elements in our binaries. It caused Firefox to crash (usually on startup). I am just being extra careful (and it seems that from the preliminary testing, this isn't happening).
Release Note Request (optional, but appreciated)
[Why is this notable]: Important performance improvements, free toolchain
[Affects Firefox for Android]: No
[Suggested wording]: The build infrastructure of Firefox moved to the Clang tool chain, bringing important performance gains.
[Links (documentation, blog post, etc)]: We probably don't want to link to this blog post but Nathan did a great work explaining the advantages https://blog.mozilla.org/nfroyd/2018/05/29/when-implementation-monoculture-right-thing/
relnote-firefox: --- → ?
Depends on: 1476604
Depends on: 1483752
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: