1464070 - (t-linux64-ms-280) [MDC1] t-linux64-ms-280 problem tracking

Reporter

Description

•

6 years ago

The machine is not showing up in taskcluster and freezes up in PXE boot.

Zsolt Fay [:zfay]

Reporter

Updated

•

6 years ago

Depends on: 1464064

Zsolt Fay [:zfay]

Reporter

Comment 1

•

6 years ago

t-linux64-ms-279 does the same thing after a cold boot.

:dhouse

Comment 2

•

6 years ago

On 280, the pxeboot menu never displays. Instead it is trying to connect and fails over to ipv6. On this text is displayed:
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.100
```
other details: 280 was a loaner and was not reclaimed yet (bug 1410207). So it was working correctly but was not running taskcluster worker. However, we still need to reimage it to reclaim it from being a loaner (bug is already closed). So we need to fix the pxeboot.

279 is not running taskcluster worker. It is not disabled in the puppet node definitions, but it appears to need reimaging or to be re-puppetized because it does not get have any taskcluster worker files in place (may not be getting the correct node definition. there is no /etc/taskcluster*yaml or /usr/local/bin/run-tc-worker.sh etc).
However we cannot reimage 279 because sees the same pxeboot failure as 280.

:dhouse

Updated

•

6 years ago

Blocks: 1464080

Dragos Crisan [:dragrom]

Assignee

Comment 4

•

6 years ago

(In reply to Dave House [:dhouse] from comment #2)
> On 280, the pxeboot menu never displays. Instead it is trying to connect and
> fails over to ipv6. On this text is displayed:
> ```
> >> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 
> 
> >> Booting PXE over IPv4.
>   Station IP address is 10.49.58.100
> ```
> other details: 280 was a loaner and was not reclaimed yet (bug 1410207). So
> it was working correctly but was not running taskcluster worker. However, we
> still need to reimage it to reclaim it from being a loaner (bug is already
> closed). So we need to fix the pxeboot.
> 
> 279 is not running taskcluster worker. It is not disabled in the puppet node
> definitions, but it appears to need reimaging or to be re-puppetized because
> it does not get have any taskcluster worker files in place (may not be
> getting the correct node definition. there is no /etc/taskcluster*yaml or
> /usr/local/bin/run-tc-worker.sh etc).
> However we cannot reimage 279 because sees the same pxeboot failure as 280.

Looking into nodes.pp:
# Loaner for dividehex
node 't-linux64-ms-279.test.releng.mdc1.mozilla.com' {
    $aspects = [ 'low-security' ]
    include toplevel::server
}

:dhouse

Comment 5

•

6 years ago

(In reply to Dragos Crisan [:dragrom] from comment #4)
> (In reply to Dave House [:dhouse] from comment #2)
> > On 280, the pxeboot menu never displays. Instead it is trying to connect and
> > fails over to ipv6. On this text is displayed:
> > ```
> > >> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 
> > 
> > >> Booting PXE over IPv4.
> >   Station IP address is 10.49.58.100
> > ```
> > other details: 280 was a loaner and was not reclaimed yet (bug 1410207). So
> > it was working correctly but was not running taskcluster worker. However, we
> > still need to reimage it to reclaim it from being a loaner (bug is already
> > closed). So we need to fix the pxeboot.
> > 
> > 279 is not running taskcluster worker. It is not disabled in the puppet node
> > definitions, but it appears to need reimaging or to be re-puppetized because
> > it does not get have any taskcluster worker files in place (may not be
> > getting the correct node definition. there is no /etc/taskcluster*yaml or
> > /usr/local/bin/run-tc-worker.sh etc).
> > However we cannot reimage 279 because sees the same pxeboot failure as 280.
> 
> Looking into nodes.pp:
> # Loaner for dividehex
> node 't-linux64-ms-279.test.releng.mdc1.mozilla.com' {
>     $aspects = [ 'low-security' ]
>     include toplevel::server
> }

:) Thank you!

:dhouse

Comment 6

•

6 years ago

Mark and Jake, I'm cc'ing you on this bug where we started tracking the pxeboot failing on some of the linux moonshots:

So far, these fail to network boot (timeout):
t-linux64-ms-{193,279,280}

There are likely others that also fail, but we have not tried netbooting all of the linux nodes recently.

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 7

•

6 years ago

Node 281 (Windows/Win vlan) is also hit an issue on pxe boot. It attempts to over IPv4, fails, and then continuously try over IPv6. I find the proximity of 279, 280, and 281 curious.

:dhouse

Comment 8

•

6 years ago

(In reply to Mark Cornmesser [:markco] from comment #7)
> Node 281 (Windows/Win vlan) is also hit an issue on pxe boot. It attempts to
> over IPv4, fails, and then continuously try over IPv6. I find the proximity
> of 279, 280, and 281 curious.

+1 I'll try a network boot on the other linux nodes on moon-chassis-7 (the linux queue is empty. so I'm not concerned about it backing-up if I pull a few workers out):

t-linux64-ms-{271..280}.test.releng.mdc1.mozilla.com
c1n1..c10n1

:dhouse

Comment 9

•

6 years ago

(In reply to Dave House [:dhouse] from comment #8)
> (In reply to Mark Cornmesser [:markco] from comment #7)
> > Node 281 (Windows/Win vlan) is also hit an issue on pxe boot. It attempts to
> > over IPv4, fails, and then continuously try over IPv6. I find the proximity
> > of 279, 280, and 281 curious.
> 
> +1 I'll try a network boot on the other linux nodes on moon-chassis-7 (the
> linux queue is empty. so I'm not concerned about it backing-up if I pull a
> few workers out):
> 
> t-linux64-ms-{271..280}.test.releng.mdc1.mozilla.com
> c1n1..c10n1

t-linux64-ms-271 has the same failure: Times-out on ipv4 pxeboot. 
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.92
```

I'm testing the others also {272..278}.

Maybe something like the hp uefi boot setting is not set for this chassis or set of machines.

:dhouse

Comment 10

•

6 years ago

```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.92

  Server IP address is 10.48.75.31
  NBP filename is /bootx64.efi
  NBP filesize is 0 Bytes
  PXE-E18: Server response timeout.
```

:dhouse

Comment 11

•

6 years ago

(In reply to Dave House [:dhouse] from comment #9)
> I'm testing the others also {272..278}.

t-linux64-ms-{271..280} all have this problem.

:dhouse

Comment 12

•

6 years ago

(In reply to Dave House [:dhouse] from comment #10)
> ```
> >> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 
> 
> >> Booting PXE over IPv4.
>   Station IP address is 10.49.58.92
> 
>   Server IP address is 10.48.75.31
>   NBP filename is /bootx64.efi
>   NBP filesize is 0 Bytes
>   PXE-E18: Server response timeout.
> ```

Rob, could you verify that the hp uefi and other dhcp-options are set correctly in mdc1 for the moon-chassis-7 hosts (I found the hp uefi filter in infoblox but I was not able to find where it was turned on)?
[10.49.58.91 - 10.49.58.100] (t-linux64-ms-{271..280} and 
10.49.40.182 t-w1064-ms-281.wintest.releng.mdc1.mozilla.com

We are seeing timeouts on pxeboot for these (for both linux and windows)

Above is what I see for the linux machines at pxeboot, and here is for windows on 10.49.40.182:
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  PXE-E18: Server response timeout.

>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv6) 

>> Booting PXE over IPv6
  PXE-E21: Remote boot cancelled.
```

Flags: needinfo?(rtucker)

:dhouse

Comment 13

•

6 years ago

Well shoot, I spot-checked nodes on moon-chassis-1 and moon-chassis-6, and I see the same pxe timeout (i verified that I get the same result when tried from the ilo ssh and java interfaces):
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.76

  Server IP address is 10.48.75.31
  NBP filename is /bootx64.efi
  NBP filesize is 0 Bytes
  PXE-E18: Server response timeout.
```

I tried cold-boot and a restart, and get the same pxe timeout for both.

I know that we successfully uefi/pxebooted these earlier this week (CIDuty and I have reimaged about 8 linux moonshots in mdc1). So maybe there was a change in the pxe/tftp server or network.

Rob Tucker [:rtucker]

Comment 14

•

6 years ago

I haven't made any changes.

Is there anything specific you want me to confirm?

Flags: needinfo?(rtucker)

:dhouse

Comment 15

•

6 years ago

(In reply to Rob Tucker [:rtucker] from comment #14)
> I haven't made any changes.
> 
> Is there anything specific you want me to confirm?

Could you show me where the hp uefi boot filter is set on the releng network in infoblox mdc1, and what other dhcp options are set for test.releng.mdc1? I tried to find them, but I only found the definition for the hp uefi filter and so I think I'm not checking the correct place.

Flags: needinfo?(rtucker)

Rob Tucker [:rtucker]

Comment 16

•

6 years ago

The filter "HP - UEFI Clients 00007" is set correctly on 10.51.56.0/22 and 10.49.56.0/22 hence getting the filename of /bootx64.efi. You wouldn't get /bootx64.efi without the filter.

You can view the options by clicking the settings wheel next to the network in the IPAM browser and looking at the IPv4 DHCP Options

Is 10.48.75.31 the proper tftp server? This is the most likely issue, NOT the filter.

Flags: needinfo?(rtucker)

:dhouse

Comment 17

•

6 years ago

(In reply to Rob Tucker [:rtucker] from comment #16)
> The filter "HP - UEFI Clients 00007" is set correctly on 10.51.56.0/22 and
> 10.49.56.0/22 hence getting the filename of /bootx64.efi. You wouldn't get
> /bootx64.efi without the filter.
> 
> You can view the options by clicking the settings wheel next to the network
> in the IPAM browser and looking at the IPv4 DHCP Options
> 
> Is 10.48.75.31 the proper tftp server? This is the most likely issue, NOT
> the filter.

Thank you. Following your directions, I see the dhcp options.

I'll change the next server to match the change for the other use of that admin server from bug 1354300

:dhouse

Comment 18

•

6 years ago

I've changed the tftp-server for releng.mdc1 and releng.mdc2, in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1464493. I don't see the change yet when I test rebooting t-linux64-ms-280. I'll try it again in the morning (maybe it takes some time to apply).

Attila Craciun [:arny]

Comment 19

•

6 years ago

Just a note, t-linux64-ms-280 is a loaner for :dragrom

:dhouse

Comment 20

•

6 years ago

(In reply to Attila Craciun [:arny] from comment #19)
> Just a note, t-linux64-ms-280 is a loaner for :dragrom

:arny could you link the loaner bug to this bug?

:dhouse

Comment 21

•

6 years ago

I tested pxeboot again this morning, on t-linux64-ms-005 as it was not running a task, and it still gets the old tftp server:
```
>> Booting Embedded LOM 1 Port 1 : Mellanox Network Adapter - NIC (PXE IPv4) 

>> Booting PXE over IPv4.
  Station IP address is 10.49.58.5

  Server IP address is 10.48.75.31
  NBP filename is /bootx64.efi
  NBP filesize is 0 Bytes
  PXE-E18: Server response timeout.
```

Attila Craciun [:arny]

Comment 22

•

6 years ago

(In reply to Dave House [:dhouse] from comment #20)
> (In reply to Attila Craciun [:arny] from comment #19)
> > Just a note, t-linux64-ms-280 is a loaner for :dragrom
> 
> :arny could you link the loaner bug to this bug?

It is already set Bug 1410207.

Attila Craciun [:arny]

Comment 23

•

6 years ago

Attached image Screenshot from 2018-05-30 17-19-49.png — Details

Dave, linux-ms-193 PXE works  but is centos 7.

Attila Craciun [:arny]

Comment 24

•

6 years ago

Attached image Screenshot from 2018-05-30 17-20-58.png — Details

Rob Tucker [:rtucker]

Comment 25

•

6 years ago

worked with :dhouse via IRC and updated the next-server options.

:dhouse

Comment 26

•

6 years ago

> (In reply to Dave House [:dhouse] from comment #20)
> > (In reply to Attila Craciun [:arny] from comment #19)
> > > Just a note, t-linux64-ms-280 is a loaner for :dragrom
> > 
> > :arny could you link the loaner bug to this bug?
> 
> It is already set Bug 1410207.

Ok, since that is resolved I'll reimage 280 (not that the pxeboot is fixed) to put it back into service.

Attila Craciun [:arny]

Comment 27

•

6 years ago

Do not re-image 280, :dragrom still need it  as loan :).

Attila Craciun [:arny]

Comment 28

•

6 years ago

Attached image Screenshot from 2018-05-31 09-43-47.png — Details

PXE works now, however, the Ubuntu image is broken. Tested on 001 and 193, same message even if I add the PUPPET_PASS option or not.

:dhouse

Comment 29

•

6 years ago

Dragos is needing #280 as a loaner for the next two weeks for testing puppet changes for bug 1465309

:dhouse

Comment 30

•

6 years ago

(In reply to Attila Craciun [:arny] from comment #28)
> Created attachment 8982132 [details]
> Screenshot from 2018-05-31 09-43-47.png
> 
> PXE works now, however, the Ubuntu image is broken. Tested on 001 and 193,
> same message even if I add the PUPPET_PASS option or not.

The pxe/netboot reimaging process is now fixed. So when we need to reimage this one it will work.

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 31

•

6 years ago

(In reply to Attila Craciun [:arny] from comment #19)
> Just a note, t-linux64-ms-280 is a loaner for :dragrom

Apart from working on that machine and changing syslog to use TCP instead of UDP, can you confirm that you have this machine as a loan?

Flags: needinfo?(dcrisan)

Dragos Crisan [:dragrom]

Assignee

Comment 32

•

6 years ago

Yes, I'll need this machine as loan, to test changes for 1465309

Flags: needinfo?(dcrisan)

Zsolt Fay [:zfay]

Reporter

Comment 33

•

6 years ago

Should we leave this bug opened and @dragrom let us know when you are done with the machine?

:dhouse

Comment 34

•

6 years ago

:dragrom, are you done with the loaner t-linux64-ms-280 ?

Alias: t-linux64-ms-280

Flags: needinfo?(dcrisan)

Summary: t-linux64-ms-280 problem tracking → t-linux64-ms-280.test.releng.mdc1.mozilla.com. problem tracking

Dragos Crisan [:dragrom]

Assignee

Comment 35

•

6 years ago

We can consider this loaner a staging worker, like t-yosemite-r7-380. In my opinion, we can close this bug

Flags: needinfo?(dcrisan)

:dhouse

Comment 36

•

6 years ago

Ok, let's keep this bug open as a marker for it as not being production (Until there is a better way to track the different types).

:dhouse

Updated

•

6 years ago

Blocks: 1464064

No longer depends on: 1464064

:dhouse

Comment 37

•

6 years ago

Puppet failures were repeated today on this machine. So I stopped this machine (powered-off through ilo).

Assignee: nobody → dcrisan

Flags: needinfo?(dcrisan)

Dragos Crisan [:dragrom]

Assignee

Comment 38

•

6 years ago

This is the error:

Thu Jul 05 12:20:23 -0700 2018 Puppet (err): Could not delete user cltbld: Execution of '/usr/sbin/userdel cltbld' returned 8: userdel: user cltbld is currently used by process 2882
Thu Jul 05 12:20:23 -0700 2018 /Stage[main]/Main/User[cltbld]/ensure (err): change from present to absent failed: Could not delete user cltbld: Execution of '/usr/sbin/userdel cltbld' returned 8: userdel: user cltbld is currently used by process 2882

This error is caused by the following definition from the puppet master:
 node 't-linux64-ms-280.test.releng.mdc1.mozilla.com' {
     $aspects = [ 'low-security' ]
     include toplevel::server
}

After landing the patch from Bug 1473281, this error will disappear.

Flags: needinfo?(dcrisan)

:dhouse

Comment 39

•

6 years ago

I kicked off the reimage this morning, but I never saw a report from puppet. I may have typo'd the kickstart password. i'm re-trying reimaging it now.

:dhouse

Comment 40

•

6 years ago

(In reply to Dave House [:dhouse] from comment #39)
> I kicked off the reimage this morning, but I never saw a report from puppet.
> I may have typo'd the kickstart password. i'm re-trying reimaging it now.

I checked through ilo and found a kernal panic (not able to mount fs) logged to the screen (from the earlier reimage).

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Updated

•

6 years ago

Summary: t-linux64-ms-280.test.releng.mdc1.mozilla.com. problem tracking → [MDC1] t-linux64-ms-280 problem tracking

:dhouse

Comment 41

•

6 years ago

(In reply to Dave House [:dhouse] from comment #40)
> (In reply to Dave House [:dhouse] from comment #39)
> > I kicked off the reimage this morning, but I never saw a report from puppet.
> > I may have typo'd the kickstart password. i'm re-trying reimaging it now.
> 
> I checked through ilo and found a kernal panic (not able to mount fs) logged
> to the screen (from the earlier reimage).

I watched on retrying the reimage and it hit a timeout on the initrd.gz download for initial setup. On trying again to capture a log through ilo+ssh, it did not timeout and so it entered the ubuntu setup correctly.

:dhouse

Comment 42

•

6 years ago

(In reply to Dave House [:dhouse] from comment #41)
> (In reply to Dave House [:dhouse] from comment #40)
> > (In reply to Dave House [:dhouse] from comment #39)
> > > I kicked off the reimage this morning, but I never saw a report from puppet.
> > > I may have typo'd the kickstart password. i'm re-trying reimaging it now.
> > 
> > I checked through ilo and found a kernal panic (not able to mount fs) logged
> > to the screen (from the earlier reimage).
> 
> I watched on retrying the reimage and it hit a timeout on the initrd.gz
> download for initial setup. On trying again to capture a log through
> ilo+ssh, it did not timeout and so it entered the ubuntu setup correctly.

https://papertrailapp.com/systems/1645518191/events
280 came up correctly from reinstall+puppetize.

:dhouse

Comment 43

•

6 years ago

There were repeated puppet failures on this machine today. I've powered it off.
Dragos, please power it back on if you need to test on it tomorrow.

Flags: needinfo?(dcrisan)

Danut Labici [:dlabici]

Comment 44

•

6 years ago

Rising this bug in priority as it's a known machine with problems. 
When the machine is fixed, feel free to remove the P1 from the bug.

Priority: -- → P1

Dragos Crisan [:dragrom]

Assignee

Comment 45

•

6 years ago

The errors were generated by my tests to install generic-worker on Linux. The puppet is now fixed on my environment. Please restart the machine

Flags: needinfo?(dcrisan)

:dhouse

Comment 46

•

6 years ago

I've powered it back on.

Priority: P1 → --

Dragos Crisan [:dragrom]

Assignee

Comment 47

•

6 years ago

Now, this machine is part of linux-talos staging pool: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos-b

Dragos Crisan [:dragrom]

Assignee

Comment 48

•

6 years ago

I'll close the bug since this machine is now part of the staging pool

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Screenshot from 2018-05-30 17-19-49.png 6 years ago Attila Craciun [:arny] 17.14 KB, image/png		Details
Screenshot from 2018-05-30 17-20-58.png 6 years ago Attila Craciun [:arny] 19.26 KB, image/png		Details
Screenshot from 2018-05-31 09-43-47.png 6 years ago Attila Craciun [:arny] 13.29 KB, image/png		Details