Closed Bug 937279 Opened 11 years ago Closed 7 years ago

w864 slaves fail cloning and rebooting; get into infinite retry loops

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 8
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: SOLUTIONS: modify reboot steps on ScriptFactory AND add auto retry logic)

OS: Linux → Windows 8
It seems that they all versions of Windows share the same HG version. [1]
Same vlans. [2]

Could it be some sort of weird interactions from hgtool.py and hg?
It would be great to have logs to inspect.

[1]
cltbld@T-W732-IX-001 ~
$ /c/mozilla-build/hg/hg.exe --version
Mercurial Distributed SCM (version 2.5.4)

cltbld@T-W864-IX-001 ~
$ /c/mozilla-build/hg/hg.exe --version
Mercurial Distributed SCM (version 2.5.4)

$ /c/mozilla-build/hg/hg.exe --version
Mercurial Distributed SCM (version 2.5.4)

[2]
armenzg@UbuntVM:~/moz/patches$ host t-w732-ix-001
t-w732-ix-001.build.mozilla.org is an alias for t-w732-ix-001.wintest.releng.scl3.mozilla.com.
t-w732-ix-001.wintest.releng.scl3.mozilla.com has address 10.26.40.151
armenzg@UbuntVM:~/moz/patches$ host t-w864-ix-001
t-w864-ix-001.build.mozilla.org is an alias for t-w864-ix-001.wintest.releng.scl3.mozilla.com.
t-w864-ix-001.wintest.releng.scl3.mozilla.com has address 10.26.40.31
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-001
t-xp32-ix-001.build.mozilla.org is an alias for t-xp32-ix-001.wintest.releng.scl3.mozilla.com.
t-xp32-ix-001.wintest.releng.scl3.mozilla.com has address 10.26.41.21
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-109
t-xp32-ix-109.build.mozilla.org is an alias for t-xp32-ix-109.wintest.releng.scl3.mozilla.com.
t-xp32-ix-109.wintest.releng.scl3.mozilla.com has address 10.26.41.129
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-075
t-xp32-ix-075.build.mozilla.org is an alias for t-xp32-ix-075.wintest.releng.scl3.mozilla.com.
t-xp32-ix-075.wintest.releng.scl3.mozilla.com has address 10.26.41.95
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-116
t-xp32-ix-116.build.mozilla.org is an alias for t-xp32-ix-116.wintest.releng.scl3.mozilla.com.
t-xp32-ix-116.wintest.releng.scl3.mozilla.com has address 10.26.41.136
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #4)
> It seems that they all versions of Windows share the same HG version. [1]
> Same vlans. [2]
> 
> Could it be some sort of weird interactions from hgtool.py and hg?

Probably not. Some of these happen without hgtool.py (from https://bugzilla.mozilla.org/show_bug.cgi?id=900825#c3):
> 'c:\\mozilla-build\\hg\\hg' 'clone' 'http://hg.mozilla.org/build/mozharness' 'scripts'
> program finished with exit code -1073741800

> It would be great to have logs to inspect.

Full example here: https://tbpl.mozilla.org/php/getParsedLog.php?id=30418653&tree=Fx-Team
Something interesting here is that mercurial dies very quickly - it doesn't even print "requesting all changes". The Buildbot output says the elapsed time is just 100ms.
I wonder why we don't call hgtool.py for mozharness:
'c:\\mozilla-build\\hg\\hg' 'clone' 'http://hg.mozilla.org/build/mozharness' 'scripts'

On another note, we don't reboot after the job fails since count_and_reboot is inside of the mozharness repo.
Maybe we loose DNS resolution and since we don't reboot we can't recover.

Maybe we can look at the Error logs that Windows has.
Host t-w864-ix-033 had its first incident on Sat Nov 9 10:27:28 2013.

Perhaps we should change ScriptFactory to call directly the reboot command without count_and_reboot.py.
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #7)
> I wonder why we don't call hgtool.py for mozharness:
> 'c:\\mozilla-build\\hg\\hg' 'clone' 'http://hg.mozilla.org/build/mozharness'
> 'scripts'

We don't have hgtool until we have mozharness.


> On another note, we don't reboot after the job fails since count_and_reboot
> is inside of the mozharness repo.
> Maybe we loose DNS resolution and since we don't reboot we can't recover.

Interesting thought, this seems rather plausible! It fits in with the fact that hg dies so quickly. Is there another bug for this intermittent loss in DNS?

> Maybe we can look at the Error logs that Windows has.
> Host t-w864-ix-033 had its first incident on Sat Nov 9 10:27:28 2013.

Indeed. We have this at 9:24am on that day:
The system failed to register host (A or AAAA) resource records (RRs) for network adapter
with settings:

           Adapter Name : {35E0B54E-FAD3-461C-8363-19AFE39D4762}
           Host Name : T-W864-IX-033
           Primary Domain Suffix : releng.ad.mozilla.com
           DNS server list :
             	10.26.75.40, 10.26.75.41
           Sent update to server : <?>
           IP Address(es) :
             10.26.40.63

The reason the system could not register these RRs during the update request was because of a system problem. You can manually retry DNS registration of the network adapter and its settings by typing 'ipconfig /registerdns' at the command prompt. If problems still persist, contact your DNS server or network systems administrator. See event details for specific error code information.

> Perhaps we should change ScriptFactory to call directly the reboot command
> without count_and_reboot.py.

Maybe...this is tricky to do cross-platform though. We should file a separate bug for it.
Given that this is confirmed to be a DNS issue and _not_ an issue with the slaves themselves, I'm going to put all of the "affected" slaves back in production.

The current workaround is a manual reboot, but we need to find some way to stop these slaves from getting into infinite RETRY loops.
Severity: normal → major
Q, any thoughts on comment #8 or 9?
Flags: needinfo?(q)
Severity: major → normal
Summary: many w864 slaves failing to clone repositories with weird exit codes → w864 slaves lose dns settings sometimes; get into infinite retry loops
This might not be DNS related after all....
12:49 < dustin> so this was an onsite DNS "blip"?
12:49 < bhearsum|buildduty> i'm not entirely sure
12:49 < bhearsum|buildduty> https://bugzilla.mozilla.org/show_bug.cgi?id=937279#c8 has an event from the Event Viewer that has details
12:50 < dustin> ah, that's about using dynamic DNS to register its forward/reverse with a DNS server
12:50 < dustin> we don't support dynamic DNS
12:50 < dustin> which is why it failed
12:51 < dustin> but those forward/reverse records are (should be!) already in DNS
12:51 < dustin> I thought we had disabled that, but my memory's fuzzy lo those many months ago
12:51 < bhearsum|buildduty> so, this could just be coincidence...
12:51 < dustin> yeah
12:51 < bhearsum|buildduty> we're not 100% certain dns failure is causing these hg issues
12:51 < dustin> I'm fairly certain all windows hosts will show that error message from time to time
The event viewer message is about failing to modify the host's forward/reverse DNS using Dynamic DNS.  Our DNS servers don't support Dynamic DNS, so this isn't surprising, and I suspect that all of our Windows hosts will have this error in their event logs.  I have a vague memory of suggesting that we disable these registrations, but I don't know if that ever happened.

In other words, nothing I see here implicates external DNS failures.

I think the more interesting clue will be in the failures.  From the dep bugs, I see

--
Has a case of the "any use of hg results in 'program finished with exit code -1073741800' and setting RETRY" blues.
--

where that code is coming from HG.EXE.  So figuring out what that means and what might cause it would be a good start.  A few bigger log snippets might help, too, especially with timestamps that we could correlate with other events, e.g., DNS server logs.
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #12)
> --
> Has a case of the "any use of hg results in 'program finished with exit code
> -1073741800' and setting RETRY" blues.
> --
> 
> where that code is coming from HG.EXE.  So figuring out what that means and
> what might cause it would be a good start.  A few bigger log snippets might
> help, too, especially with timestamps that we could correlate with other
> events, e.g., DNS server logs.

Unfortunately, there's no output from hg when we hit this. It runs for about 100ms and then dies. https://bugzilla.mozilla.org/show_bug.cgi?id=937154#c0 has the most complete log snippet we have.
Thanks!  -1073741800 expressed in 32-bit 2's complement is 0xc0000018.

Per http://www.tenox.net/links/ntstatus.html:

C0000018	STATUS_CONFLICTING_ADDRESSES

that leads to
  http://stackoverflow.com/questions/2622607/windows-loader-problem-turn-on-verbose-mode
  http://ntsecurity.nu/onmymind/2006/2006-06-01.html

suggesting that this occurs when the same region of memory is mapped for two purposes.  This is sounding suspiciously like the Cygwin flaw, where the OS doesn't have enough flexibility to put objects like DLLs at arbitrary locations in the process's memory map, and two objects overlap.

But I should probably leave this to the experts..
Dustin has already pointed out the windows errors being a red herring. However, in response to his response I have a GPO already to disable able client DNS trying to send DDNS updates to the servers. It shouldn't affect anything other than removing some noise from the windows logs.
Flags: needinfo?(q)
Depends on: t-w864-ix-127
Depends on: t-w864-ix-028
Any reason why this is happening mainly on our Windows 8 machines? Is it because it is a 64-bit arch rather than 32-bit?
Whiteboard: SOLUTIONS: modify reboot steps on ScriptFactory AND add auto retry logic
Summary: w864 slaves lose dns settings sometimes; get into infinite retry loops → w864 slaves fail cloning and rebooting; get into infinite retry loops
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #16)
> Any reason why this is happening mainly on our Windows 8 machines? Is it
> because it is a 64-bit arch rather than 32-bit?

--^

Regardless, if we know or not the answer. I believe we should modify the reboot step to at least manage to recover if count_and_reboot.py is missing (rather than retrying the hg step - which might ).
Flags: needinfo?(q)
Flags: needinfo?(dustin)
I'm not sure why you flagged me for needinfo here - was there a question?
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #18)
> I'm not sure why you flagged me for needinfo here - was there a question?

comment 16.
Ah, no idea here.
NVM. We better fix the reboot step or wait for bug =712205.
Flags: needinfo?(q)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.