Closed Bug 1026870 Opened 10 years ago Closed 9 years ago

Something wrong with windows build slaves ("LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt")

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

All
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glandium, Assigned: q)

References

Details

(Keywords: intermittent-failure, Whiteboard: [release-impacting])

See https://tbpl.mozilla.org/?tree=Try&jobname=win&rev=c6fe0466209b for example. It's a try with sccache disabled (to rule it out), and exhibiting the same problem as other older builds on those slaves.

The error is:
LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt 

usually when linking ICU, but it also happens when linking something from the crash reporter on some other builds.

Apparently, only the slaves between 20 and 29 are affected, although I haven't observed failures on all of them because they don't have all recently built something, all the slaves that do fail to build are in that range.

It started happening today or yesterday.

Related information about this error code:
http://msdn.microsoft.com/en-us/library/7dz62kfh.aspx
http://support.microsoft.com/kb/2757355
so these slaves do seem to be using vs2010 still as you can see with lines like /c/PROGRA~2/MICROS~2.0. The slave still shows this shortname still points to:
    /c/Program\ Files\ \(x86\)/Microsoft\ Visual\ Studio\ 10.0/

so it is not using the junction set up at /c/tools/vs2013/.

I am not sure why this is failing here. On my testing slave I was able to build this builder on m-c(not try) against vs2010 while vs2013 was also installed on the machine: http://people.mozilla.org/~jlund/vs2010-mozilla-central-winxp-opt.log
the fallout of burning these jobs falls on my head.

Regardless of my test slave not catching hitting this for whatever reason, it looks like this is a common issue. The resolution seems to be to install Visual Studio 2010 SP1 as mentioned in here[1] and also here[2].

pinging Q and dmajor for relops and dev perspective - Is it possible to do update our machine's VS2010 install to SP1 or should we re-image these hosts and look at coming up with another alternative?

[1] - from comment 1 - http://support.microsoft.com/kb/2757355
[2] - http://stackoverflow.com/questions/10888391/error-link-fatal-error-lnk1123-failure-during-conversion-to-coff-file-inval
Flags: needinfo?(q)
Flags: needinfo?(dmajor)
From stack overflow: "Note that installing VS 2010 SP1 will remove the 64-bit compilers. You need to install the VS 2010 SP1 compiler pack to get them back."
You should check directly on one of those problematic slaves, and check the first microsoft link in comment #0 before considering upgrading MSVC.
I will need to follow up on this one with my team as we applied the same gpo here as we did on the test slave. If sp1 does not fix this immediately we will re image the machines and then I think we should pull one and teat against it.
Flags: needinfo?(q)
I'd rather you do it the other way around. Installing SP1 means changing the compiler. This can have any sorts of effects, like sucking more memory to do PGO linkage, or miscompiling, and that shouldn't be done lightly.
SP1 involves a change to the CRT for Firefox releases, and we should avoid that.
OK I've asked to revert the changes I requested: https://bugzilla.mozilla.org/show_bug.cgi?id=1019165#c12

re-imaging those machines sounds like best course of action for now.
Flags: needinfo?(dmajor)
Postmortem shows this:
 VS 2010 sp1 and the compiler pack was installed in conjunction with VS 2012 for initial testing on the test machine. We were planning on rolling it with VS 2012 per releng instructions then we (Releng/Relops) made the decision to  skip 2012 and go to 2013  and that 2010 patch  step was lost since the test machine already had it. To move forward we will need 2010 sp1 and the compiler reinstalled before going to 2013.
Just in case  I will get a 2010 sp1/compiler pack installer GPO ready but I will need confirmation that we we can run with 2010 SP1 on the builders
The affected builders are being re-imaged now
(In reply to Q from comment #10)
> Just in case  I will get a 2010 sp1/compiler pack installer GPO ready but I
> will need confirmation that we we can run with 2010 SP1 on the builders

See comment 7 and comment 8.
Thanks I missed comment 8
Thanks I missed comment 8
All slaves re-imaged and confirmed except 0023 which was slow due to a disk check being forced.
all slaves done.
Who is in charge of getting these back in?
Assignee: nobody → q
Flags: needinfo?(mh+mozilla)
That would be the lucky buildduty, aka jlund.
Component: General Automation → Buildduty
Flags: needinfo?(mh+mozilla)
QA Contact: catlee → bugspam.Callek
I didn't get around to this today. I will add them first thing in the morning. Win builders are far from our worst wait_times so this should be fine.
Also WRT actually installing vs2013, it looks like there are two options:

1) we have two separate pools of windows build machines. 1 for vs2010 and 1 for vs2013 where we slowly fade out the former.

2) we install VS2010 SP1 and VS2013 on all our build machines.

I would like to weigh out the pros/cons of both options in a discussion with folks more knowledgeable than myself.

Yes, the CRT will change, and we will have to measure performance in all our builders to check for diffs in things like PGO. This could be bad. But I'd like to discuss how bad, and is it worse than dividing up our win pools. Dividing up is not very optimal from Mozilla's release engineering side of things and will come with its own consequences.

Armen - I believe you have done some work on investigating vs2010 SPI before, maybe with the vs2012 work Q mentioned here: https://bugzilla.mozilla.org/show_bug.cgi?id=1026870#c10 Do you have any input?
bhearsum - I heard rumors that you used to do all the 'windows' stuff before jhopkins and armen. any thoughts?

maybe a quick group meeting with bsmedberg and/or glandium to sort this out would be best?
Flags: needinfo?(bhearsum)
Flags: needinfo?(armenzg)
(In reply to Jordan Lund (:jlund) from comment #20)
> Also WRT actually installing vs2013, it looks like there are two options:
> 
> 1) we have two separate pools of windows build machines. 1 for vs2010 and 1
> for vs2013 where we slowly fade out the former.
> 
> 2) we install VS2010 SP1 and VS2013 on all our build machines.
> 
> I would like to weigh out the pros/cons of both options in a discussion with
> folks more knowledgeable than myself.
> 
> Yes, the CRT will change, and we will have to measure performance in all our
> builders to check for diffs in things like PGO. This could be bad. But I'd
> like to discuss how bad, and is it worse than dividing up our win pools.
> Dividing up is not very optimal from Mozilla's release engineering side of
> things and will come with its own consequences.

I'm extremely wary of changing anything in the toolchain for Beta, Release, and ESR. However, it *is* early in a Beta cycle, so we'd have lots of time to prove out the change with Beta users. This option would require RelMan sign off for sure because of the risk involved.

I haven't been following this bug until now - I'm assuming the CRT only changes for option #2. If so, can I ask why it must change? We already set PATH/LIB/etc. for our compiler - can we not pick up the old CRT by setting one of those? as far as I know, we've done all of our compilers upgrades without dividing the pool, so I'd like to understand what's special about this one. Eg: bug 563318 was the tracker for 2010. bug 563317 shows us installing it on existing machines.

The other downside to dividing up the pool is that we end up with worse machine utilization in both pools, which reduces our throughput for changes overall.
Flags: needinfo?(bhearsum)
I think dividing the pool is going to risk us missing all sorts of weirdness that we don't initially associate with a certain set of machines, and instead think is some sporadic intermittent issue.
How about option 3: look into http://msdn.microsoft.com/en-us/library/7dz62kfh.aspx and see if it's not possible to *not* install SP1.
(In reply to Ben Hearsum [:bhearsum] from comment #21)
> as far as I know, we've done all of our compilers upgrades
> without dividing the pool, so I'd like to understand what's special about
> this one.

AIUI, VS2012/2013 installs a CVTRES.EXE that is not compatible with VS2010, and, it looks like it puts it in some shared location, overwriting VS2010's, or in a directory that has precedence in $PATH. I hope it's the latter.

As to why that didn't happen when we switched to VS2010, in all likeliness, VS2010 wasn't installing a CRTRES.EXE that is not compatible with VS2005/VS2008.
As Ed and Ben mentions we should have divided pools unless we have to.

jlund, with regards previous knowledge, I have only one input: order matters with Windows, specially when talking about Visual Studio. Installers and uninstallers can be doing things we would not expect them too.

Another tip: do not try to install in custom places unless you have to. I had once a variable that you could set for a specific path, however, not every component of VS would install there as they had other secret environment variables.

Best of luck.

PS = Don't call Ben a Windows expert or he'll come at you :P
Flags: needinfo?(armenzg)
With only MSVC2013 installed, cvtres.exe is installed to c:\Program Files (x86)/Microsoft Visual Studio 12.0/VC/BIN/cvtres.exe

If cvtres.exe is indeed the problem, we should check and see which version is on the PATH currently, and which version ends up on the PATH in the "bad" case, and hopefully we can just munge the PATH to get the right thing on top in each case.
it sounds like bsmedberg and glandium's suggestion (option 3) is worth perusing.

I've filed to get another test machine but with vs2010 (the non sp1 version) and vs2013 installed on it: bug 1027745

I am closing this bug for now as this was a buildduty infra error bug which has been resolved. tracking getting the 10 re-imaged machines back into production will happen in the original rollout bug: Bug 1019165

overall goal of vs2013 in prod is still tracked here: Bug 1009807 - Figure out the correct path setup and mozconfigs for automation-driven MSVC2013 builds

ty you all for your aid thus far.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #26)
> With only MSVC2013 installed, cvtres.exe is installed to c:\Program Files
> (x86)/Microsoft Visual Studio 12.0/VC/BIN/cvtres.exe
> 
> If cvtres.exe is indeed the problem, we should check and see which version
> is on the PATH currently, and which version ends up on the PATH in the "bad"
> case, and hopefully we can just munge the PATH to get the right thing on top
> in each case.

And even if cvtres.exe is overwritten by default, we could manually copy the older version somewhere and point PATH at it where needed, I think.
Some potential workarounds: http://social.msdn.microsoft.com/Forums/vstudio/en-US/d10adba0-e082-494a-bb16-2bfc039faa80/vs2012-rc-installation-breaks-vs2010-c-projects?forum=vssetup

It seems the root cause is a .NET DLL dependency, which people have worked around by overwriting cvtres, getting a different one onto PATH, or tinkering with the machine's .NET installation. 

(I don't see a way to link to specific comments, but you can search for "Proposed as answer")
Tweaking summary to make the issue easier to find, particularly given we're unfortunately seeing it again not that bug 1019165 has started rolling out again.
Summary: Something wrong with b-2008-ix-002x slaves → Something wrong with b-2008-ix-002x slaves ("LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt")
Adding keyword so TBPL sees this.
I filed bug 1049794 for b-2008-ix-0120 and disabled it in slavealloc.
This was due to a VS 2013 push to a machine with an active job. Markco can you make sure these are disabled and jobs are DONE BEFORE deploying 2013 ?
Flags: needinfo?(mcornmesser)
Flags: needinfo?(arich)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(arich)
reopening, as:
 a) 13 reports by sheriffs since closed
 b) lots of issues hit in bug 1057549 which point to very slow machines

(b) hit a number of release builds this cycle.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [release-impacting]
Summary: Something wrong with b-2008-ix-002x slaves ("LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt") → Something wrong with windows build slaves ("LINK : fatal error LNK1123: failure during conversion to COFF: file invalid or corrupt")
I have a suspicion: both this issue and bug 1057229 comment 5 could be explained by some builders picking up an incomplete VS2013 setup (i.e. without the cvtres fixup, or without Update 3). I don't know how that might happen, though.
Could the build slave become active before the GPO for VS2013 is fully applied? I assume it's a long install.
Depends on: 1057549
Blocks: 1062877
This is being addressed by https://bugzilla.mozilla.org/show_bug.cgi?id=1063372 and https://bugzilla.mozilla.org/show_bug.cgi?id=1063018
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
We're still seeing these failures pretty regularly, especially on the release branches.
https://treeherder.mozilla.org/ui/logviewer.html#?job_id=72561&repo=mozilla-b2g32_v2_0
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #54)
> We're still seeing these failures pretty regularly, especially on the
> release branches.
> https://treeherder.mozilla.org/ui/logviewer.html#?job_id=72561&repo=mozilla-
> b2g32_v2_0

ni: q: this issue just doesn't want to go away! :) the job Ryan is pointing to was against b-2008-ix-0005 @ Tue Dec 16 15:13:49 2014. is it possible this slave recently imaged or had an incomplete gpo?
Flags: needinfo?(q)
HMM well we are no longer applying 2013 from GPO only in the base image so it in theory can no longer get an incomplete install. I am doing a full audit to make that is 100% true
Flags: needinfo?(q)
(In reply to Q from comment #56)
> HMM well we are no longer applying 2013 from GPO only in the base image so
> it in theory can no longer get an incomplete install. I am doing a full
> audit to make that is 100% true

hmm, 008, 001, and 005 (twice) have recently hit this. aside from this error, sheriffs are also seeing timeouts, essentially https://bugzil.la/1055876 again. Since this bug and 1055876 were related last time, I wonder if there is something similar at play, granted this isn't part of GPO anymore.

sheriffs are now being forced disabling slaves that hit this to to (a) stress the importance of this and (b) show the scope of what slaves are 'bad'.

Q, how did the audit go? would you be able to spend some brain cycles assisting me with this? I'm free most this week. thanks for bearing with me :)
Flags: needinfo?(q)
Couldn't find anything obvious the gpo isn't at play anywhere.  I have set time aside tomorrow to slog through this one. Maybe we can do some vidyo time after I do some more digging in the morning?
Flags: needinfo?(q)
Do these errors happen to start when we started installing the new HG version?
HG started rolling out on December 11th 2014
(In reply to Q from comment #63)
> HG started rolling out on December 11th 2014

TBH - I am not sure. Ryan made a comment on Dec 17th after a 3 month gap saying that we are still seeing this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1026870#c54

Ryan, is it possible that this only started acting up again around the 11th?
Status: RESOLVED → REOPENED
Flags: needinfo?(ryanvm)
Resolution: FIXED → ---
(In reply to Jordan Lund (:jlund) PTO till Jan 14th from comment #64)
> Ryan, is it possible that this only started acting up again around the 11th?

Could be, can't say I remember at this point.
Flags: needinfo?(ryanvm)
Unfortunately, looking a whole lot like "anything that has been reimaged in the last month or so."
Per discussions it is possible that the HG install having parts of the VS redistributable included in them may be causing the problem. The fix may be the correction of the system PATH or investigating the overlap. There are cases of machines failing that have not been re-imaged
In bug 1117900 comment 20, Ryan found a builder with an old VS install. Relevant?
I was hoping it was however, it doesn't seem likely as we have found machines reporting the correct VS version with the issue. The machines with the wrong version have been tracked to testing machines that did not get re-imaged so far.
Since it was brought up in IRC today. The three month gap is concerning in that nothing we can find changed in the build or re-image process and this seems to happen on machines that have and have not been re-imaged in the failure window. However, some GPO changes were made around the time the failures started for git and HG installs.
I’ve started reimaging the dependent machines in batches of 5, with an hour wait between batches.
(In reply to Chris Cooper [:coop] from comment #78)
> I’ve started reimaging the dependent machines in batches of 5, with an hour
> wait between batches.

These are all re-imaged and re-enabled now.
I think this might be fixed with GPO updates discussed in IRC ( will transcribe here when I get to a non laptop keyboard). We haven't seen any new reports since 01/09. Is there a way to confirm the fix?
Flags: needinfo?(ryanvm)
Flags: needinfo?(coop)
I think that the lack of new slave disablings or reports in this bug since is a positive sign :) Let's give it a week or so and resolve the bug if things look good?
Flags: needinfo?(ryanvm)
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #88)
> I think that the lack of new slave disablings or reports in this bug since
> is a positive sign :) Let's give it a week or so and resolve the bug if
> things look good?

Sounds good to me.
Flags: needinfo?(coop)
apologies, I was on PTO.

Thanks Q for implementing a likely fix.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.