Closed Bug 1464070 (t-linux64-ms-280) Opened 6 years ago Closed 6 years ago

[MDC1] t-linux64-ms-280 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zfay, Assigned: dragrom)

References

Details

Attachments

(3 files)

The machine is not showing up in taskcluster and freezes up in PXE boot.
Depends on: 1464064
t-linux64-ms-279 does the same thing after a cold boot.
On 280, the pxeboot menu never displays. Instead it is trying to connect and fails over to ipv6. On this text is displayed:
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.100
```
other details: 280 was a loaner and was not reclaimed yet (bug 1410207). So it was working correctly but was not running taskcluster worker. However, we still need to reimage it to reclaim it from being a loaner (bug is already closed). So we need to fix the pxeboot.

279 is not running taskcluster worker. It is not disabled in the puppet node definitions, but it appears to need reimaging or to be re-puppetized because it does not get have any taskcluster worker files in place (may not be getting the correct node definition. there is no /etc/taskcluster*yaml or /usr/local/bin/run-tc-worker.sh etc).
However we cannot reimage 279 because sees the same pxeboot failure as 280.
Blocks: 1464080
(In reply to Dave House [:dhouse] from comment #2)
> On 280, the pxeboot menu never displays. Instead it is trying to connect and
> fails over to ipv6. On this text is displayed:
> ```
> >> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 
> 
> >> Booting PXE over IPv4.
>   Station IP address is 10.49.58.100
> ```
> other details: 280 was a loaner and was not reclaimed yet (bug 1410207). So
> it was working correctly but was not running taskcluster worker. However, we
> still need to reimage it to reclaim it from being a loaner (bug is already
> closed). So we need to fix the pxeboot.
> 
> 279 is not running taskcluster worker. It is not disabled in the puppet node
> definitions, but it appears to need reimaging or to be re-puppetized because
> it does not get have any taskcluster worker files in place (may not be
> getting the correct node definition. there is no /etc/taskcluster*yaml or
> /usr/local/bin/run-tc-worker.sh etc).
> However we cannot reimage 279 because sees the same pxeboot failure as 280.

Looking into nodes.pp:
# Loaner for dividehex
node 't-linux64-ms-279.test.releng.mdc1.mozilla.com' {
    $aspects = [ 'low-security' ]
    include toplevel::server
}
(In reply to Dragos Crisan [:dragrom] from comment #4)
> (In reply to Dave House [:dhouse] from comment #2)
> > On 280, the pxeboot menu never displays. Instead it is trying to connect and
> > fails over to ipv6. On this text is displayed:
> > ```
> > >> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 
> > 
> > >> Booting PXE over IPv4.
> >   Station IP address is 10.49.58.100
> > ```
> > other details: 280 was a loaner and was not reclaimed yet (bug 1410207). So
> > it was working correctly but was not running taskcluster worker. However, we
> > still need to reimage it to reclaim it from being a loaner (bug is already
> > closed). So we need to fix the pxeboot.
> > 
> > 279 is not running taskcluster worker. It is not disabled in the puppet node
> > definitions, but it appears to need reimaging or to be re-puppetized because
> > it does not get have any taskcluster worker files in place (may not be
> > getting the correct node definition. there is no /etc/taskcluster*yaml or
> > /usr/local/bin/run-tc-worker.sh etc).
> > However we cannot reimage 279 because sees the same pxeboot failure as 280.
> 
> Looking into nodes.pp:
> # Loaner for dividehex
> node 't-linux64-ms-279.test.releng.mdc1.mozilla.com' {
>     $aspects = [ 'low-security' ]
>     include toplevel::server
> }

:) Thank you!
Mark and Jake, I'm cc'ing you on this bug where we started tracking the pxeboot failing on some of the linux moonshots:

So far, these fail to network boot (timeout):
t-linux64-ms-{193,279,280}

There are likely others that also fail, but we have not tried netbooting all of the linux nodes recently.
Node 281 (Windows/Win vlan) is also hit an issue on pxe boot. It attempts to over IPv4, fails, and then continuously try over IPv6. I find the proximity of 279, 280, and 281 curious.
(In reply to Mark Cornmesser [:markco] from comment #7)
> Node 281 (Windows/Win vlan) is also hit an issue on pxe boot. It attempts to
> over IPv4, fails, and then continuously try over IPv6. I find the proximity
> of 279, 280, and 281 curious.

+1 I'll try a network boot on the other linux nodes on moon-chassis-7 (the linux queue is empty. so I'm not concerned about it backing-up if I pull a few workers out):

t-linux64-ms-{271..280}.test.releng.mdc1.mozilla.com
c1n1..c10n1
(In reply to Dave House [:dhouse] from comment #8)
> (In reply to Mark Cornmesser [:markco] from comment #7)
> > Node 281 (Windows/Win vlan) is also hit an issue on pxe boot. It attempts to
> > over IPv4, fails, and then continuously try over IPv6. I find the proximity
> > of 279, 280, and 281 curious.
> 
> +1 I'll try a network boot on the other linux nodes on moon-chassis-7 (the
> linux queue is empty. so I'm not concerned about it backing-up if I pull a
> few workers out):
> 
> t-linux64-ms-{271..280}.test.releng.mdc1.mozilla.com
> c1n1..c10n1

t-linux64-ms-271 has the same failure: Times-out on ipv4 pxeboot. 
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.92
```

I'm testing the others also {272..278}.

Maybe something like the hp uefi boot setting is not set for this chassis or set of machines.
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.92

  Server IP address is 10.48.75.31
  NBP filename is /bootx64.efi
  NBP filesize is 0 Bytes
  PXE-E18: Server response timeout.
```
(In reply to Dave House [:dhouse] from comment #9)
> I'm testing the others also {272..278}.

t-linux64-ms-{271..280} all have this problem.
(In reply to Dave House [:dhouse] from comment #10)
> ```
> >> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 
> 
> >> Booting PXE over IPv4.
>   Station IP address is 10.49.58.92
> 
>   Server IP address is 10.48.75.31
>   NBP filename is /bootx64.efi
>   NBP filesize is 0 Bytes
>   PXE-E18: Server response timeout.
> ```

Rob, could you verify that the hp uefi and other dhcp-options are set correctly in mdc1 for the moon-chassis-7 hosts (I found the hp uefi filter in infoblox but I was not able to find where it was turned on)?
[10.49.58.91 - 10.49.58.100] (t-linux64-ms-{271..280} and 
10.49.40.182 t-w1064-ms-281.wintest.releng.mdc1.mozilla.com

We are seeing timeouts on pxeboot for these (for both linux and windows)

Above is what I see for the linux machines at pxeboot, and here is for windows on 10.49.40.182:
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  PXE-E18: Server response timeout.

>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv6) 

>> Booting PXE over IPv6
  PXE-E21: Remote boot cancelled.
```
Flags: needinfo?(rtucker)
Well shoot, I spot-checked nodes on moon-chassis-1 and moon-chassis-6, and I see the same pxe timeout (i verified that I get the same result when tried from the ilo ssh and java interfaces):
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.76

  Server IP address is 10.48.75.31
  NBP filename is /bootx64.efi
  NBP filesize is 0 Bytes
  PXE-E18: Server response timeout.
```

I tried cold-boot and a restart, and get the same pxe timeout for both.

I know that we successfully uefi/pxebooted these earlier this week (CIDuty and I have reimaged about 8 linux moonshots in mdc1). So maybe there was a change in the pxe/tftp server or network.
I haven't made any changes.

Is there anything specific you want me to confirm?
Flags: needinfo?(rtucker)
(In reply to Rob Tucker [:rtucker] from comment #14)
> I haven't made any changes.
> 
> Is there anything specific you want me to confirm?

Could you show me where the hp uefi boot filter is set on the releng network in infoblox mdc1, and what other dhcp options are set for test.releng.mdc1? I tried to find them, but I only found the definition for the hp uefi filter and so I think I'm not checking the correct place.
Flags: needinfo?(rtucker)
The filter "HP - UEFI Clients 00007" is set correctly on 10.51.56.0/22 and 10.49.56.0/22 hence getting the filename of /bootx64.efi. You wouldn't get /bootx64.efi without the filter.

You can view the options by clicking the settings wheel next to the network in the IPAM browser and looking at the IPv4 DHCP Options

Is 10.48.75.31 the proper tftp server? This is the most likely issue, NOT the filter.
Flags: needinfo?(rtucker)
(In reply to Rob Tucker [:rtucker] from comment #16)
> The filter "HP - UEFI Clients 00007" is set correctly on 10.51.56.0/22 and
> 10.49.56.0/22 hence getting the filename of /bootx64.efi. You wouldn't get
> /bootx64.efi without the filter.
> 
> You can view the options by clicking the settings wheel next to the network
> in the IPAM browser and looking at the IPv4 DHCP Options
> 
> Is 10.48.75.31 the proper tftp server? This is the most likely issue, NOT
> the filter.

Thank you. Following your directions, I see the dhcp options.

I'll change the next server to match the change for the other use of that admin server from bug 1354300
I've changed the tftp-server for releng.mdc1 and releng.mdc2, in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1464493. I don't see the change yet when I test rebooting t-linux64-ms-280. I'll try it again in the morning (maybe it takes some time to apply).
Just a note, t-linux64-ms-280 is a loaner for :dragrom
(In reply to Attila Craciun [:arny] from comment #19)
> Just a note, t-linux64-ms-280 is a loaner for :dragrom

:arny could you link the loaner bug to this bug?
I tested pxeboot again this morning, on t-linux64-ms-005 as it was not running a task, and it still gets the old tftp server:
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.5

  Server IP address is 10.48.75.31
  NBP filename is /bootx64.efi
  NBP filesize is 0 Bytes
  PXE-E18: Server response timeout.
```
(In reply to Dave House [:dhouse] from comment #20)
> (In reply to Attila Craciun [:arny] from comment #19)
> > Just a note, t-linux64-ms-280 is a loaner for :dragrom
> 
> :arny could you link the loaner bug to this bug?

It is already set Bug 1410207.
Dave, linux-ms-193 PXE works  but is centos 7.
worked with :dhouse via IRC and updated the next-server options.
> (In reply to Dave House [:dhouse] from comment #20)
> > (In reply to Attila Craciun [:arny] from comment #19)
> > > Just a note, t-linux64-ms-280 is a loaner for :dragrom
> > 
> > :arny could you link the loaner bug to this bug?
> 
> It is already set Bug 1410207.

Ok, since that is resolved I'll reimage 280 (not that the pxeboot is fixed) to put it back into service.
Do not re-image 280, :dragrom still need it  as loan :).
PXE works now, however, the Ubuntu image is broken. Tested on 001 and 193, same message even if I add the PUPPET_PASS option or not.
Dragos is needing #280 as a loaner for the next two weeks for testing puppet changes for bug 1465309
(In reply to Attila Craciun [:arny] from comment #28)
> Created attachment 8982132 [details]
> Screenshot from 2018-05-31 09-43-47.png
> 
> PXE works now, however, the Ubuntu image is broken. Tested on 001 and 193,
> same message even if I add the PUPPET_PASS option or not.

The pxe/netboot reimaging process is now fixed. So when we need to reimage this one it will work.
(In reply to Attila Craciun [:arny] from comment #19)
> Just a note, t-linux64-ms-280 is a loaner for :dragrom

Apart from working on that machine and changing syslog to use TCP instead of UDP, can you confirm that you have this machine as a loan?
Flags: needinfo?(dcrisan)
Yes, I'll need this machine as loan, to test changes for 1465309
Flags: needinfo?(dcrisan)
Should we leave this bug opened and @dragrom let us know when you are done with the machine?
:dragrom, are you done with the loaner t-linux64-ms-280 ?
Alias: t-linux64-ms-280
Flags: needinfo?(dcrisan)
Summary: t-linux64-ms-280 problem tracking → t-linux64-ms-280.test.releng.mdc1.mozilla.com. problem tracking
We can consider this loaner a staging worker, like t-yosemite-r7-380. In my opinion, we can close this bug
Flags: needinfo?(dcrisan)
Ok, let's keep this bug open as a marker for it as not being production (Until there is a better way to track the different types).
Blocks: 1464064
No longer depends on: 1464064
Puppet failures were repeated today on this machine. So I stopped this machine (powered-off through ilo).
Assignee: nobody → dcrisan
Flags: needinfo?(dcrisan)
This is the error:

Thu Jul 05 12:20:23 -0700 2018 Puppet (err): Could not delete user cltbld: Execution of '/usr/sbin/userdel cltbld' returned 8: userdel: user cltbld is currently used by process 2882
Thu Jul 05 12:20:23 -0700 2018 /Stage[main]/Main/User[cltbld]/ensure (err): change from present to absent failed: Could not delete user cltbld: Execution of '/usr/sbin/userdel cltbld' returned 8: userdel: user cltbld is currently used by process 2882

This error is caused by the following definition from the puppet master:
 node 't-linux64-ms-280.test.releng.mdc1.mozilla.com' {
     $aspects = [ 'low-security' ]
     include toplevel::server
}

After landing the patch from Bug 1473281, this error will disappear.
Flags: needinfo?(dcrisan)
I kicked off the reimage this morning, but I never saw a report from puppet. I may have typo'd the kickstart password. i'm re-trying reimaging it now.
(In reply to Dave House [:dhouse] from comment #39)
> I kicked off the reimage this morning, but I never saw a report from puppet.
> I may have typo'd the kickstart password. i'm re-trying reimaging it now.

I checked through ilo and found a kernal panic (not able to mount fs) logged to the screen (from the earlier reimage).
Summary: t-linux64-ms-280.test.releng.mdc1.mozilla.com. problem tracking → [MDC1] t-linux64-ms-280 problem tracking
(In reply to Dave House [:dhouse] from comment #40)
> (In reply to Dave House [:dhouse] from comment #39)
> > I kicked off the reimage this morning, but I never saw a report from puppet.
> > I may have typo'd the kickstart password. i'm re-trying reimaging it now.
> 
> I checked through ilo and found a kernal panic (not able to mount fs) logged
> to the screen (from the earlier reimage).

I watched on retrying the reimage and it hit a timeout on the initrd.gz download for initial setup. On trying again to capture a log through ilo+ssh, it did not timeout and so it entered the ubuntu setup correctly.
(In reply to Dave House [:dhouse] from comment #41)
> (In reply to Dave House [:dhouse] from comment #40)
> > (In reply to Dave House [:dhouse] from comment #39)
> > > I kicked off the reimage this morning, but I never saw a report from puppet.
> > > I may have typo'd the kickstart password. i'm re-trying reimaging it now.
> > 
> > I checked through ilo and found a kernal panic (not able to mount fs) logged
> > to the screen (from the earlier reimage).
> 
> I watched on retrying the reimage and it hit a timeout on the initrd.gz
> download for initial setup. On trying again to capture a log through
> ilo+ssh, it did not timeout and so it entered the ubuntu setup correctly.

https://papertrailapp.com/systems/1645518191/events
280 came up correctly from reinstall+puppetize.
There were repeated puppet failures on this machine today. I've powered it off.
Dragos, please power it back on if you need to test on it tomorrow.
Flags: needinfo?(dcrisan)
Rising this bug in priority as it's a known machine with problems. 
When the machine is fixed, feel free to remove the P1 from the bug.
Priority: -- → P1
The errors were generated by my tests to install generic-worker on Linux. The puppet is now fixed on my environment. Please restart the machine
Flags: needinfo?(dcrisan)
I've powered it back on.
Priority: P1 → --
I'll close the bug since this machine is now part of the staging pool
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: