Closed Bug 1149377 Opened 9 years ago Closed 9 years ago

Windows PGO tests are mostly permafail after uplift to aurora

Categories

(Core :: JavaScript Engine, defect)

39 Branch
x86
Windows XP
defect
Not set
blocker

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox39 + fixed
firefox40 --- fixed

People

(Reporter: KWierso, Assigned: nbp)

References

Details

(Keywords: crash)

Crash Data

Attachments

(6 files)

Interestingly, the Windows 8 PGO tests are green, it's just the 32-bit XP and 7 that are failing.
A few notes:
1.) All Aurora Windows opt builds are PGO. So we don't know definitively yet if it's truly PGO-only or not.
2.) This didn't show up on my last set of uplift simulation Try runs from last Friday (and PGO is enabled on those), which gives a rough regression range of https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=44e454b5e93b&tochange=1b6bf6612c0f
3.) I have a set of non-PGO Try pushes going from over the range in #2 to see if it reproduces w/o PGO. Hopefully that answer is yes as it'll make bisecting way easier.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #2)
> 3.) I have a set of non-PGO Try pushes going from over the range in #2 to
> see if it reproduces w/o PGO. Hopefully that answer is yes as it'll make
> bisecting way easier.

Looking green. Sigh :(
[Tracking Requested - why for this release]: Last I checked, we like having test coverage for the primary platform we ship.

PGO runs triggered across the range from comment 2.
See Also: → 1034706
More reduced regression range:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e16dc56f90d5&tochange=02f2f4c75007
Component: Buildduty → JavaScript Engine
Product: Release Engineering → Core
QA Contact: bugspam.Callek
Severity: normal → blocker
Thanks Ryan, let me know if I can help. Getting a smaller regression range is probably the best way forward right now so I'll wait for that.
Tracking as it is critical.
Naveed, seems like we need your help! Could you help us with that? Thanks (this is critical as we cannot reenable 39 aurora updates).
Flags: needinfo?(nihsanullah)
Crash Signature: [@ js::jit::EnterBaselineMethod(JSContext*, js::RunState&)]
Keywords: crash
OS: Windows 8.1 → Windows XP
Hardware: x86_64 → x86
Version: unspecified → 39 Branch
Would it be possible for you to test with each of the following preferences:

1/ pref("javascript.option.asmjs", false);

2/ pref("javascript.option.baselinejit", false);

3/ pref("javascript.option.ion", false);

4/ pref("javascript.options.ion.offthread_compilation", false);

5/ pref("javascript.option.asmjs", false);
   pref("javascript.option.baselinejit", false);
   pref("javascript.option.ion", false);

And tell us if this still reproduce the PGO issues?
Flags: needinfo?(petruta.rasa)
Attached patch Aurora backoutSplinter Review
Here's a patch to backout b2904e8f07e7, 	509282768033 and 8787eda5c93e from Aurora across the big style change.

The last one can be backed out without conflicts, so I'm not posting a separate patch for it.
Still burns with just bug 1145811 backed out. Looks like we'll need to take both.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f1efb1501029
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #13)
> https://hg.mozilla.org/releases/mozilla-aurora/rev/759eb70deb31

And we're confirmed green on Aurora. Leaving status-firefox39 set to affected under the assumption that we eventually want to revert the backout in comment 13 in favor of a more targeted fix.
Assignee: nobody → nicolas.b.pierron
(In reply to Nicolas B. Pierron [:nbp] from comment #10)
> Would it be possible for you to test with each of the following preferences:
> 
> 1/ pref("javascript.option.asmjs", false);
> 
> 2/ pref("javascript.option.baselinejit", false);
> 
> 3/ pref("javascript.option.ion", false);
> 
> 4/ pref("javascript.options.ion.offthread_compilation", false);
> 
> 5/ pref("javascript.option.asmjs", false);
>    pref("javascript.option.baselinejit", false);
>    pref("javascript.option.ion", false);
> 
> And tell us if this still reproduce the PGO issues?

The crash reproduces with every one of the above preferences.
Flags: needinfo?(petruta.rasa)
I think this might take me multiple days to isolate the minimal set of functions which are causing this issue.  Still I am a bit worried with the previous message, knowing that these 3 patches are JIT only modifications, as it implies that even if we disable all the JITs (5) we might still get this issue.
Ok, as promise, it seems that the PGO issue is located in BaselineIC.cpp.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=07460ab2c07b
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a1a5abf479a5

I will see if I can precisely isolate the function is responsible for the PGO failure.
I cancelled at least the following (I stopped copy/pasting at around 50, I probably cancelled more and there were previous ones already cancelled):

https://treeherder.mozilla.org/#/jobs?repo=try&revision=70b70c4ee911
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dac5a70a9567
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d511200fa0de
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6a30059c9579
https://treeherder.mozilla.org/#/jobs?repo=try&revision=623291af66a3
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cba2df129cab
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fce5456fdb26
https://treeherder.mozilla.org/#/jobs?repo=try&revision=059e5739abc9
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e6557fc2630c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e75996fb8aed
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8ce77f561d22
https://treeherder.mozilla.org/#/jobs?repo=try&revision=53af553ac122
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5d91c7b3e0fd
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b91239bb969
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f478a2b52777
https://treeherder.mozilla.org/#/jobs?repo=try&revision=30c279269b8f
https://treeherder.mozilla.org/#/jobs?repo=try&revision=aef9c1c09dcb
https://treeherder.mozilla.org/#/jobs?repo=try&revision=350ce8b6df8c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=443504efd62e
https://treeherder.mozilla.org/#/jobs?repo=try&revision=db2e6e65474c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8af732e51c39
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4a5d4a3d0370
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee355d06627e
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e15b4ad076d1
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5d26e2ae2093
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6bda981516e4
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9756fd2cf2c9
https://treeherder.mozilla.org/#/jobs?repo=try&revision=76ac8e2985c6
https://treeherder.mozilla.org/#/jobs?repo=try&revision=74d32470d432
https://treeherder.mozilla.org/#/jobs?repo=try&revision=80bd2572042e
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c95799805d5b
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ffdc74502c33
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd3729e1e206
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6b806684758e
https://treeherder.mozilla.org/#/jobs?repo=try&revision=19beaf235dc1
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f096c420f75e
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b42dd0f558e8
https://treeherder.mozilla.org/#/jobs?repo=try&revision=83b759f5003f
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f32c05945d26
https://treeherder.mozilla.org/#/jobs?repo=try&revision=53fc8bca153c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f8e5c7a8e31e
https://treeherder.mozilla.org/#/jobs?repo=try&revision=84686b61587c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0dd5bf384424
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c69bb6792435
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6d268404d428
https://treeherder.mozilla.org/#/jobs?repo=try&revision=41ed8454c31f
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9f62aed878fc
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b500a0eae497
https://treeherder.mozilla.org/#/jobs?repo=try&revision=adc46fd1626d

I'm sorry I had to do this, because this will likely delay investigating this bug, but you can't push 400 hours of work for the same platform to try at once without impacting everyone else.
FWIW, I retriggered a few, now that the number of pending try builds on windows is at a decent level.
Just thought of this: if that can be reproduced by disabling pgo for *everything* but the file you're bisecting in, builds would finish much faster.
(In reply to Mike Hommey [:glandium] from comment #18)
> I cancelled at least the following (I stopped copy/pasting at around 50, I
> probably cancelled more and there were previous ones already cancelled):

This are not copy & paste, I made patches to disable one function at a time.

The previously canceled one were going to fail anyway as the first patch was malformed and caused compilation errors.
 
(In reply to Mike Hommey [:glandium] from comment #20)
> Just thought of this: if that can be reproduced by disabling pgo for
> *everything* but the file you're bisecting in, builds would finish much
> faster.

Do we have documentation to explain how to do this, or you are suggesting to annotate every other file with the pragma?  How fast would it build, as PGO still implies compiling twice and running a set of benchmark to collect PGO data, I doubt this will save much over 400h.

dbaron suggested on IRC that we could reduce the priority of the job, is there an easy way to do that?  I would prefer to wait for result than waiting for making pushes, knowing that I am not awake 24h/24 (as opposed to some believe).
(In reply to Nicolas B. Pierron [:nbp] from comment #21)
> (In reply to Mike Hommey [:glandium] from comment #18)
> > I cancelled at least the following (I stopped copy/pasting at around 50, I
> > probably cancelled more and there were previous ones already cancelled):
> 
> This are not copy & paste, I made patches to disable one function at a time.

I meant I stopped copy/pasting treeherder urls at around 50, and I'm sure I cancelled more than that.

> The previously canceled one were going to fail anyway as the first patch was
> malformed and caused compilation errors.
>  
> (In reply to Mike Hommey [:glandium] from comment #20)
> > Just thought of this: if that can be reproduced by disabling pgo for
> > *everything* but the file you're bisecting in, builds would finish much
> > faster.
> 
> Do we have documentation to explain how to do this, or you are suggesting to
> annotate every other file with the pragma?

One documented way is to add NO_PGO = True to every moz.build file, and set SOURCES['foo.cpp'].no_pgo = True for all files in js/src. That's a lot of work.
If I were you, I'd just modify config/config.mk to never add PROFILE_GEN_CFLAGS to the cflags, and then add SOURCES['foo.cpp'].flags += ['-GL'] for the one file you want to PGO.

> How fast would it build, as PGO still implies compiling twice and running a
> set of benchmark to collect PGO data, I doubt this will save much over 400h.

Most of the time during a PGO build is spent during linkage, where the linker is actually compiling with the PGO optimizations. Everything .obj is not compiled with -GL is *not* compiled during linkage but before, which makes linkage *much* faster. That could save more than half the amount of time.

> dbaron suggested on IRC that we could reduce the priority of the job, is
> there an easy way to do that?  I would prefer to wait for result than
> waiting for making pushes, knowing that I am not awake 24h/24 (as opposed to
> some believe).

Well, that wouldn't help much actually, unless there are many pending jobs already when you push, because the only way to reduce the priority of a job is for it to be pending already. And if the number of pending jobs is low, well, if you do 50 pushes, *you* end up filling the queue... Except that thanks to the spidermonkey jobs also being automatigally triggered, and there not being a way for them not to be, when you do 50 pushes, you push 150 jobs, and there are only 57 build slaves for windows. Thankfully, the SM jobs are not taking 4 hours like the others... So, no, our automation is not very helpful at the moment for things like what you're trying to do. The best you could do is either find volunteers who would push for you over 24 hours, or automate on your end, doing only a few pushes every couple hours or so.
I'd like this to be marked as fixed for 39, and maybe open this in a new bug. 
That way I can be clear that we aren't blocked on enabling updates for Aurora tomorrow.
Flags: needinfo?(ryanvm)
Whatever floats your boat.
Flags: needinfo?(ryanvm)
How about disabling say 10 functions at a time? Would that be an acceptable load for try?

And has anyone tried actually debugging the crash? Or do you expect that the effect is too far removed from the cause?
(In reply to :dmajor (semi-away, use needinfo) from comment #25)
> How about disabling say 10 functions at a time? Would that be an acceptable
> load for try?

I am currently tri-secting, which should give me a linear set of function by Saturday (instead of 12h ago).

Then I will cross-fingers and double check if the issue is only located in one unique function, and not multiple.  Which means that I should have an answer by Sunday if there is only one function responsible.

> And has anyone tried actually debugging the crash? Or do you expect that the
> effect is too far removed from the cause?

No, and as the error is in the generated code, getting back to the function which produced it is not something trivial as it might depend on the input of the function.  The failure is likely caused by a bad register allocation in the generated code, but the bad register allocation will not tell us how badly PGO behaves.

The current method will give us an answer while we can still continue to work on something else.  Even if I would prefer to have the answer ASAP, doing this kind of investigation manually is not worth our time unless we have other choices.
So I tried an experiment. RyanVM triggered PGO Try builds from the same base changeset, one that was built as Nightly (http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/ryanvm@gmail.com-0faff334312d/try-win32/ ) and one that was built as Aurora (http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/ryanvm@gmail.com-04486b1fe169/try-win32/ ). nbp said that he knew the problem was in BaselineIC.cpp, so I wrote a script to parse the Breakpad symbol file, get all functions that have code from that file, and use dumpbin to spit out the disassembly. Then I ran it on both binaries and diffed the results. The script tweaks the disassembly output a bit to smooth over differences that are just from addresses shifting. You can find the script and the outputs here:
https://github.com/luser/diff-compiled-binaries

The diff is still a bit hard to read, I'm amazed at how many fiddly differences there are in the codegen. I'm not sure if this is helpful, I was hoping something would stand out but I haven't seen anything obvious. Maybe someone more familiar with the code can take a look.

If it'd be useful I could also hack the script to do source interleaved with the assembly.
So far I know that there is "at least" one issue in the following functions:

ICCallStubCompiler::pushSpreadCallArguments
ICCallStubCompiler::guardFunApply
ICCallStubCompiler::pushCallerArguments
ICCallStubCompiler::pushArrayArguments
ICCall_Fallback::Compiler::generateStubCode
ICCallScriptedCompiler::generateStubCode
ICCall_StringSplit::Compiler::generateStubCode
ICCall_IsSuspendedStarGenerator::Compiler::generateStubCode
pushCallerArguments looked suspicious, but it just seems that a function is inlined in one case and it is not in the other case.

So the problem is not coming from pushCallerArguments.
Ok, the last try pushes came back.
There is "at least" one issue in the following functions:

ICCallStubCompiler::guardFunApply
ICCallStubCompiler::pushCallerArguments
ICCallStubCompiler::pushArrayArguments

As I mentioned in comment 29 pushCallerArguments does not differ much between nightly/aurora except for the inlining of one function.  Identically pushArrayArguments does something similar, which looks like the same inlined functions (branchPtr).

So, I suspect that there is "at least" one issue in ICCallStubCompiler::guardFunApply. At first sight the aurora version of the function is no longer using a frame pointer, which will make this reverse engineering a bit more challenging.

I submitted another try push to see if this is the only issue, and if I is, I suggests that we disable PGO for this function in order to revert the backout.
I tried my best to reverse engineer what MSVC produced for guardFunApply on aurora. (with a lot of WTF in the process)

The result is scary.  Apparently, MSVC miss-compiled 2 calls to the same function branchTestObjClass in guardFunApply.  In both cases the first argument of the function, which is supposed to be a push of a constant, is replaced by 'push ecx', in the first case it happen to be defined to a value which is between 0-8, and in the second case it is unknown as exc is a volatile register.

> xx xx xx xx xx     call         ...
> 8B 44 24 34        mov         eax,dword ptr [esp+34h]
> 0F B6 C0           movzx       eax,al
> 0F BC C0           bsf         eax,eax
> 55                 push        ebp
> 68 F0 02 B7 11     push        11B702F0h
> 89 44 24 4C        mov         dword ptr [esp+4Ch],eax
> 0F B6 C0           movzx       eax,al
> 50                 push        eax
> 56                 push        esi
> 51                 push        ecx                     ; --volatile-- /!\
> 8B CB              mov         ecx,ebx
> xx xx xx xx xx     call         ...

This first argument of branchTestObjClass is not verified and is directly used to encode the opcode of a jump instruction[1].  This is more than likely to cause a JIT code issue at runtime.  The first call to branchTestObjClass might have produced a bad jump condition, but in the second case, the jump might even be replaced by anything.

Hopefully, removing PGO on this specific function seems to get rid of this issue [2], but what caused the the miss-guided optimization to "prefer" branchTestObjClass function.  Maybe we should disable PGO on all functions which are using it.

I have no idea what features of my patch enabled this issue.  One important fact is that guardFunApply is using a frame pointer on nightly, and it is using only the stack pointer on aurora.

[1] https://dxr.mozilla.org/mozilla-central/source/js/src/jit/x86-shared/Encoding-x86-shared.h#261-272
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=abd51ceb62be
Comment on attachment 8589015 [details] [diff] [review]
Disable PGO on ICCallStubCompiler::guardFunApply.

Review of attachment 8589015 [details] [diff] [review]:
-----------------------------------------------------------------

Good find!

::: js/src/jit/BaselineIC.cpp
@@ +9805,5 @@
>      masm.pushValue(Address(BaselineFrameReg, STUB_FRAME_SIZE + 1 * sizeof(Value)));
>      masm.pushValue(Address(BaselineFrameReg, STUB_FRAME_SIZE + 2 * sizeof(Value)));
>  }
>  
> +#if defined(_MSC_VER)

Nit: I'd add a comment before this line:

// MSVC 2013 with PGO miscompiles this function on Aurora.

@@ +9923,5 @@
>      }
>      return target;
>  }
>  
> +#if defined(_MSC_VER)

Nit: can you move the blank line before this #if after the #endif? Makes it less likely new functions are added inside the no-PGO code instead of outside it.
Attachment #8589015 - Flags: review?(jdemooij) → review+
(In reply to Jan de Mooij [:jandem] from comment #33)
> Nit: I'd add a comment before this line:
> 
> // MSVC 2013 with PGO miscompiles this function on Aurora.

// MSVC 2013 with PGO miscompiles this function without frame pointers.

Also, it's be better to make the test *actually* MSVC 2013 only, so that it doesn't stay in forever without being reevaluated when we upgrade.
Ryan - I may have misunderstood here. I thought this was fixed for 39 but on second look it doesn't seem to be now that I've read bug 1143011 and read this over again.
It's "fixed" in that the regressing patch got backed out from Aurora. It's "affected" in that the patch will eventually need to be re-landed with the fix from this bug included. So track that however you will :)
Flags: needinfo?(nihsanullah)
Attached file disable-pgo.sh
(script that I used) This script make a git commit after disabling PGO on a file.
Attached file disable-pgo-perfile.sh
This script disable PGO on all sections of code which got modified in a patch.

This script use "git diff -W" output to find the beginning and the end of modified function, by looking at the patch locations within the original file.  Then for each of these functions, it add the needed annotations to disable PGO.
https://hg.mozilla.org/mozilla-central/rev/2267c3b58a1c
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Comment on attachment 8589015 [details] [diff] [review]
Disable PGO on ICCallStubCompiler::guardFunApply.

Approval Request Comment
[Feature/regressing bug #]: Bug 1149377 (/ Bug 1143011 ?)
[User impact if declined]: A lot of false positive fuzzing issue Bug 1143011.
[Describe test coverage new/current, TreeHerder]: Tested on Try with Aurora configuration, and landed on inbound.
[Risks and why]: Disable MSVC 2013 PGO on one function which is being miss-compiled.
[String/UUID change made/needed]: N/A

Note, we should also revert the backout of Bug 1143011 at the same time.
Attachment #8589015 - Flags: approval-mozilla-aurora?
Comment on attachment 8589015 [details] [diff] [review]
Disable PGO on ICCallStubCompiler::guardFunApply.

Approved for uplift to aurora.
Attachment #8589015 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Depends on: 1289987
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: