x64/x86 SIMD: AVX2 support
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox101 | --- | fixed |
People
(Reporter: lth, Assigned: yury)
References
(Blocks 2 open bugs)
Details
Attachments
(7 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
We should experiment with AVX2 support for SIMD on x86 and x64. Almost 60% of our Windows users have AVX2 (https://firefoxgraphics.github.io/telemetry/#view=system), and this can only increase. With AVX2 we would get lower register pressure from the 3-address ops, and better code generation in some cases.
My understanding is that plain AVX is probably not worth the bother, but I'll try to back that up with links to discussions.
We used to have AVX support for SIMD.js but we turned it off for two reasons:
- the AVX unit is frequently powered down and powering it up is expensive. It is only code that uses Serious SIMD that benefits from getting AVX codegen.
- the YMM registers caused some weird stalls at task switch time on MacOS.
The parameters may have changed here (hypothetical fixes to MacOS + Mac is moving to ARM64; maybe the cost of enabling/disabling AVX is lower in current chips or OSs) and in addition it may be that we can stick to SSE4.1 in the baseline compiler to avoid stalls during startup. There must be other ideas.
I will try to dig up old bugs that pertain to the known problems so that we can investigate how to avoid them.
One thing I don't know yet is whether there's a penalty for mixing AVX code (instructions have a special prefix) with non-AVX code, so that it would be necessary to do AVX encodings for everything before we could assess performance.
Reporter | ||
Comment 1•4 years ago
|
||
https://github.com/WebAssembly/simd/issues/342#issuecomment-834805766 suggests (this has to be checked) that AVX-encoded instructions no longer requires memory operands to be aligned. At the moment, for example, we can't do PADDD offs(basereg), destreg because we don't normally know if the effective address is aligned. Being able to do so without worrying alignment would save us from doing a load, and save us from dedicating a register to the loaded value.
Comment 2•4 years ago
|
||
The alignment requirements are documented in Intel Manual Vol I, Chapter 14 PROGRAMMING WITH AVX, FMA AND AVX2
In short, only explicitly aligned avx instructions trigger a GP fault; "regular" avx instructions won't.
14.9 MEMORY ALIGNMENT
[...]
With the exception of explicitly aligned 16 or 32 byte SIMD load/store instructions, most VEX-encoded,
arithmetic and data processing instructions operate in a flexible environment regarding memory address
alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load semantics will support unaligned load
operation by default. Memory arguments for most instructions with VEX prefix operate normally without
causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE instructions). The instructions that
require explicit memory alignment requirements are listed in Table 14-22
[...]
Assignee | ||
Comment 3•3 years ago
|
||
Assignee | ||
Comment 4•3 years ago
|
||
Finding interesting behavior for (func (param v128 v128) (result v128) local.get 0 local.get 1 i32x4.add)
:
with AVX:
00000024 66 0f 6f d8 movdqa %xmm0, %xmm3
00000028 66 0f 6f d3 movdqa %xmm3, %xmm2
0000002C c5 e9 fe c1 vpaddd %xmm1, %xmm2, %xmm0
without AVX:
00000024 66 0f fe c1 paddd %xmm1, %xmm0
Is it regalloc issue?
Reporter | ||
Comment 5•3 years ago
|
||
That looks like a common regalloc problem, yes. In this case, the regalloc will choose to move xmm0 to xmm2 since xmm0 is used both for the result and for one operand that has to be live throughout ("useRegister"); this is a conflict.
The extra move via xmm3 is a regalloc bug that I see quite often, though it does not look like there's a bug on file for it. I think this is a missing optimization in how multiple movegroups are not sensibly merged because they are separated by an LParameter node, but this is just a guess.
If you introduce a dummy parameter 0 and then use parameters 1 and 2 for the inputs to the add you'll likely see better code. You see this hack used in some of the whitebox tests for code generation sometimes, to avoid this specific problem.
In principle, if both the lhs and rhs are marked as useRegisterAtStart instead of useRegister then the regalloc should be able to generate good code here since xmm0 can be reused. But you have to be careful about how you do this, see lowerForFPU for some sample code.
Assignee | ||
Comment 6•3 years ago
|
||
If you introduce a dummy parameter 0 and then use parameters 1 and 2 for the inputs to the add you'll likely see better code. You see this hack used in some of the whitebox tests for code generation sometimes, to avoid this specific problem.
Correct, the avoiding parameter 0 bypasses the problem. The problem appears only for simple functions where the result register is one used parameter registers (e.g. it can be param 1 if param0 has type 'i32').
Updated•3 years ago
|
Assignee | ||
Comment 7•3 years ago
|
||
Locally, if I add test-also=--enable-avx
to the directive.txt all wasm/simd/ tests pass. What will be the right way to subject avx to all simd test?
Reporter | ||
Comment 8•3 years ago
|
||
(In reply to Yury Delendik (:yury) from comment #7)
Locally, if I add
test-also=--enable-avx
to the directive.txt all wasm/simd/ tests pass. What will be the right way to subject avx to all simd test?
What happens if you flip the default value of avxEnabled to true and run all jit-tests and jstests? I think that's the only test that really matters. (Maybe we should do a phone call about this.)
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 10•3 years ago
|
||
I forced enabling AVX at try server https://treeherder.mozilla.org/jobs?repo=try&revision=8ae9a3914f6303d1d511a7784ba1e08de56b80e3&selectedTaskRun=S52NoQcxTBGINHjARFoFMQ.0
Locally and try server following jit-test (all codegen-x64) fail:
- wasm/simd/bitselect-x64-ion-codegen.js
- wasm/simd/cvt-x64-ion-codegen.js
- wasm/simd/ion-analysis.js
- wasm/simd/ion-bug1688713.js
- wasm/simd/shuffle-x86-ion-codegen.js
- wasm/simd/splat-x64-ion-codegen.js
Comment 11•3 years ago
|
||
bugherder |
Assignee | ||
Comment 12•3 years ago
|
||
Comment 13•3 years ago
|
||
Comment 14•3 years ago
|
||
bugherder |
Assignee | ||
Updated•3 years ago
|
Reporter | ||
Comment 15•3 years ago
|
||
This seems relevant: https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/x64.20SIMD.20alignment, see esp later comments about optimization advice re vzeroupper and discussion about mixing SSE and AVX encodings. I think the conclusion is, Thou shalt benchmark carefully, but worth digging into the Intel docs maybe or googling for related information.
Assignee | ||
Comment 16•3 years ago
|
||
Agree. The vzeroupper has to come in play when 256-bit / full-YMM registers are used. If nothing executes 256-bit instruction then we are okay.
Though it looks like we have to inspect entire FF code and guarantee somehow "Clean Upper State" during execution of SpiderMonkey JIT code regardless of this bug -- nowadays the dependencies might use AVX2, e.g. codecs or (bergamot) intrinsics. From what I saw so far we don't do YMM in JIT itself.
(Per https://cdrdv2.intel.com/v1/dl/getContent/671488?explicitVersion=true&wapkw=intel%2064%20and%20ia-32%20architectures%20optimization%20reference%20manual , Section 15.3 Mixing AVX code with SSE code)
Assignee | ||
Comment 17•3 years ago
|
||
Assignee | ||
Comment 18•3 years ago
|
||
Allows AVX SIMD instructions on x86/x64. Mostly as experiment for benchmarking --
if success, it will be on if available.
Depends on D135560
Comment 19•3 years ago
|
||
Comment 20•3 years ago
|
||
bugherder |
Comment 21•3 years ago
|
||
Comment 22•3 years ago
|
||
bugherder |
Assignee | ||
Comment 23•3 years ago
|
||
Adds CPUID detection, if avxPresent set.
The isAvxPresent function interface is modified to check if AVX V2 is active.
Comment 24•3 years ago
|
||
Comment 25•3 years ago
|
||
bugherder |
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 26•3 years ago
|
||
Comment 27•3 years ago
|
||
Comment 28•3 years ago
|
||
bugherder |
Assignee | ||
Comment 29•3 years ago
|
||
The x64 architectures optimization manual, section 15.3 "Mixing AVX code with SSE code", talks about runtime penalties when 256-bit and 128-bit AVX operations are mixed. After checking our code, and inserting ymm-upper-parts-are-dirty-check and running test suites, looks like we don't use 256-bit registers or properly return into the Clean (VZEROUPPER) state.
The compilers (GCC/clang) normally insert VZEROUPPER instructions on the boundaries between YMM and XMM usages, unless explicitly instructed otherwise, so libraries such as bergamot/intgemm don't represent a danger of breaking invariant of the clean state.
Assignee | ||
Comment 30•3 years ago
|
||
Assignee | ||
Comment 31•3 years ago
|
||
Support for AVX is implemented, and majority of masm operations were optimized to support VEX encoding. Also special case AVX2 variants were added if available.
Comment 32•3 years ago
|
||
Comment 33•3 years ago
|
||
bugherder |
Description
•