937279 - w864 slaves fail cloning and rebooting; get into infinite retry loops

Reporter

Description

•

11 years ago

Examples:
https://bugzilla.mozilla.org/show_bug.cgi?id=928673
https://bugzilla.mozilla.org/show_bug.cgi?id=937270
https://bugzilla.mozilla.org/show_bug.cgi?id=937273
https://bugzilla.mozilla.org/show_bug.cgi?id=937277

And probably more that I haven't seen yet.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 1

•

11 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=932167

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

11 years ago

Blocks: t-w864-ix-116

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

11 years ago

Blocks: t-w864-ix-075

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 2

•

11 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=937154

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 3

•

11 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=900825

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

11 years ago

Blocks: t-w864-ix-109

Ed Morley [:emorley]

Updated

•

11 years ago

OS: Linux → Windows 8

Armen [:armenzg]

Comment 4

•

11 years ago

It seems that they all versions of Windows share the same HG version. [1]
Same vlans. [2]

Could it be some sort of weird interactions from hgtool.py and hg?
It would be great to have logs to inspect.

[1]
cltbld@T-W732-IX-001 ~
$ /c/mozilla-build/hg/hg.exe --version
Mercurial Distributed SCM (version 2.5.4)

cltbld@T-W864-IX-001 ~
$ /c/mozilla-build/hg/hg.exe --version
Mercurial Distributed SCM (version 2.5.4)

$ /c/mozilla-build/hg/hg.exe --version
Mercurial Distributed SCM (version 2.5.4)

[2]
armenzg@UbuntVM:~/moz/patches$ host t-w732-ix-001
t-w732-ix-001.build.mozilla.org is an alias for t-w732-ix-001.wintest.releng.scl3.mozilla.com.
t-w732-ix-001.wintest.releng.scl3.mozilla.com has address 10.26.40.151
armenzg@UbuntVM:~/moz/patches$ host t-w864-ix-001
t-w864-ix-001.build.mozilla.org is an alias for t-w864-ix-001.wintest.releng.scl3.mozilla.com.
t-w864-ix-001.wintest.releng.scl3.mozilla.com has address 10.26.40.31
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-001
t-xp32-ix-001.build.mozilla.org is an alias for t-xp32-ix-001.wintest.releng.scl3.mozilla.com.
t-xp32-ix-001.wintest.releng.scl3.mozilla.com has address 10.26.41.21
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-109
t-xp32-ix-109.build.mozilla.org is an alias for t-xp32-ix-109.wintest.releng.scl3.mozilla.com.
t-xp32-ix-109.wintest.releng.scl3.mozilla.com has address 10.26.41.129
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-075
t-xp32-ix-075.build.mozilla.org is an alias for t-xp32-ix-075.wintest.releng.scl3.mozilla.com.
t-xp32-ix-075.wintest.releng.scl3.mozilla.com has address 10.26.41.95
armenzg@UbuntVM:~/moz/patches$ host t-xp32-ix-116
t-xp32-ix-116.build.mozilla.org is an alias for t-xp32-ix-116.wintest.releng.scl3.mozilla.com.
t-xp32-ix-116.wintest.releng.scl3.mozilla.com has address 10.26.41.136

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 5

•

11 years ago

(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #4)
> It seems that they all versions of Windows share the same HG version. [1]
> Same vlans. [2]
> 
> Could it be some sort of weird interactions from hgtool.py and hg?

Probably not. Some of these happen without hgtool.py (from https://bugzilla.mozilla.org/show_bug.cgi?id=900825#c3):
> 'c:\\mozilla-build\\hg\\hg' 'clone' 'http://hg.mozilla.org/build/mozharness' 'scripts'
> program finished with exit code -1073741800

> It would be great to have logs to inspect.

Full example here: https://tbpl.mozilla.org/php/getParsedLog.php?id=30418653&tree=Fx-Team

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 6

•

11 years ago

Something interesting here is that mercurial dies very quickly - it doesn't even print "requesting all changes". The Buildbot output says the elapsed time is just 100ms.

Armen [:armenzg]

Comment 7

•

11 years ago

I wonder why we don't call hgtool.py for mozharness:
'c:\\mozilla-build\\hg\\hg' 'clone' 'http://hg.mozilla.org/build/mozharness' 'scripts'

On another note, we don't reboot after the job fails since count_and_reboot is inside of the mozharness repo.
Maybe we loose DNS resolution and since we don't reboot we can't recover.

Maybe we can look at the Error logs that Windows has.
Host t-w864-ix-033 had its first incident on Sat Nov 9 10:27:28 2013.

Perhaps we should change ScriptFactory to call directly the reboot command without count_and_reboot.py.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 8

•

11 years ago

(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #7)
> I wonder why we don't call hgtool.py for mozharness:
> 'c:\\mozilla-build\\hg\\hg' 'clone' 'http://hg.mozilla.org/build/mozharness'
> 'scripts'

We don't have hgtool until we have mozharness.


> On another note, we don't reboot after the job fails since count_and_reboot
> is inside of the mozharness repo.
> Maybe we loose DNS resolution and since we don't reboot we can't recover.

Interesting thought, this seems rather plausible! It fits in with the fact that hg dies so quickly. Is there another bug for this intermittent loss in DNS?

> Maybe we can look at the Error logs that Windows has.
> Host t-w864-ix-033 had its first incident on Sat Nov 9 10:27:28 2013.

Indeed. We have this at 9:24am on that day:
The system failed to register host (A or AAAA) resource records (RRs) for network adapter
with settings:

           Adapter Name : {35E0B54E-FAD3-461C-8363-19AFE39D4762}
           Host Name : T-W864-IX-033
           Primary Domain Suffix : releng.ad.mozilla.com
           DNS server list :
             	10.26.75.40, 10.26.75.41
           Sent update to server : <?>
           IP Address(es) :
             10.26.40.63

The reason the system could not register these RRs during the update request was because of a system problem. You can manually retry DNS registration of the network adapter and its settings by typing 'ipconfig /registerdns' at the command prompt. If problems still persist, contact your DNS server or network systems administrator. See event details for specific error code information.

> Perhaps we should change ScriptFactory to call directly the reboot command
> without count_and_reboot.py.

Maybe...this is tricky to do cross-platform though. We should file a separate bug for it.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 9

•

11 years ago

Given that this is confirmed to be a DNS issue and _not_ an issue with the slaves themselves, I'm going to put all of the "affected" slaves back in production.

The current workaround is a manual reboot, but we need to find some way to stop these slaves from getting into infinite RETRY loops.

No longer blocks: t-w864-ix-109, t-w864-ix-075, t-w864-ix-116

Severity: normal → major

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 10

•

11 years ago

Q, any thoughts on comment #8 or 9?

Flags: needinfo?(q)

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

11 years ago

Severity: major → normal

Summary: many w864 slaves failing to clone repositories with weird exit codes → w864 slaves lose dns settings sometimes; get into infinite retry loops

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 11

•

11 years ago

This might not be DNS related after all....
12:49 < dustin> so this was an onsite DNS "blip"?
12:49 < bhearsum|buildduty> i'm not entirely sure
12:49 < bhearsum|buildduty> https://bugzilla.mozilla.org/show_bug.cgi?id=937279#c8 has an event from the Event Viewer that has details
12:50 < dustin> ah, that's about using dynamic DNS to register its forward/reverse with a DNS server
12:50 < dustin> we don't support dynamic DNS
12:50 < dustin> which is why it failed
12:51 < dustin> but those forward/reverse records are (should be!) already in DNS
12:51 < dustin> I thought we had disabled that, but my memory's fuzzy lo those many months ago
12:51 < bhearsum|buildduty> so, this could just be coincidence...
12:51 < dustin> yeah
12:51 < bhearsum|buildduty> we're not 100% certain dns failure is causing these hg issues
12:51 < dustin> I'm fairly certain all windows hosts will show that error message from time to time

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

11 years ago

The event viewer message is about failing to modify the host's forward/reverse DNS using Dynamic DNS.  Our DNS servers don't support Dynamic DNS, so this isn't surprising, and I suspect that all of our Windows hosts will have this error in their event logs.  I have a vague memory of suggesting that we disable these registrations, but I don't know if that ever happened.

In other words, nothing I see here implicates external DNS failures.

I think the more interesting clue will be in the failures.  From the dep bugs, I see

--
Has a case of the "any use of hg results in 'program finished with exit code -1073741800' and setting RETRY" blues.
--

where that code is coming from HG.EXE.  So figuring out what that means and what might cause it would be a good start.  A few bigger log snippets might help, too, especially with timestamps that we could correlate with other events, e.g., DNS server logs.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 13

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #12)
> --
> Has a case of the "any use of hg results in 'program finished with exit code
> -1073741800' and setting RETRY" blues.
> --
> 
> where that code is coming from HG.EXE.  So figuring out what that means and
> what might cause it would be a good start.  A few bigger log snippets might
> help, too, especially with timestamps that we could correlate with other
> events, e.g., DNS server logs.

Unfortunately, there's no output from hg when we hit this. It runs for about 100ms and then dies. https://bugzilla.mozilla.org/show_bug.cgi?id=937154#c0 has the most complete log snippet we have.

Dustin J. Mitchell [:dustin] (he/him)

Comment 14

•

11 years ago

Thanks!  -1073741800 expressed in 32-bit 2's complement is 0xc0000018.

Per http://www.tenox.net/links/ntstatus.html:

C0000018	STATUS_CONFLICTING_ADDRESSES

that leads to
  http://stackoverflow.com/questions/2622607/windows-loader-problem-turn-on-verbose-mode
  http://ntsecurity.nu/onmymind/2006/2006-06-01.html

suggesting that this occurs when the same region of memory is mapped for two purposes.  This is sounding suspiciously like the Cygwin flaw, where the OS doesn't have enough flexibility to put objects like DLLs at arbitrary locations in the process's memory map, and two objects overlap.

But I should probably leave this to the experts..

Q

Comment 15

•

11 years ago

Dustin has already pointed out the windows errors being a red herring. However, in response to his response I have a GPO already to disable able client DNS trying to send DDNS updates to the servers. It shouldn't affect anything other than removing some noise from the windows logs.

Flags: needinfo?(q)

Armen [:armenzg]

Updated

•

11 years ago

Depends on: t-w864-ix-127

Armen [:armenzg]

Updated

•

11 years ago

Depends on: t-w864-ix-028

Armen [:armenzg]

Comment 16

•

11 years ago

Any reason why this is happening mainly on our Windows 8 machines? Is it because it is a 64-bit arch rather than 32-bit?

Armen [:armenzg]

Updated

•

11 years ago

Whiteboard: SOLUTIONS: modify reboot steps on ScriptFactory AND add auto retry logic

Armen [:armenzg]

Updated

•

11 years ago

Summary: w864 slaves lose dns settings sometimes; get into infinite retry loops → w864 slaves fail cloning and rebooting; get into infinite retry loops

Armen [:armenzg]

Comment 17

•

11 years ago

(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #16)
> Any reason why this is happening mainly on our Windows 8 machines? Is it
> because it is a 64-bit arch rather than 32-bit?

--^

Regardless, if we know or not the answer. I believe we should modify the reboot step to at least manage to recover if count_and_reboot.py is missing (rather than retrying the hg step - which might ).

Flags: needinfo?(q)

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 18

•

11 years ago

I'm not sure why you flagged me for needinfo here - was there a question?

Flags: needinfo?(dustin)

Armen [:armenzg]

Comment 19

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #18)
> I'm not sure why you flagged me for needinfo here - was there a question?

comment 16.

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

11 years ago

Ah, no idea here.

Armen [:armenzg]

Comment 21

•

11 years ago

NVM. We better fix the reboot step or wait for bug =712205.

Flags: needinfo?(q)

Chris AtLee [:catlee]

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INCOMPLETE

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard