1409349 - More machine from t-w1064,t-w864,t-w732 and t-yosemite pools are unreachable

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Description

•

7 years ago

Please check the list below,all these machines are unreachable:

-t-yosemite-r7-0229
-t-w1064-ix-139
-t-w1064-ix-138
-t-yosemite-r7-0279
-t-yosemite-r7-0225 
-t-yosemite-r7-0225 
-t-yosemite-r7-0137
-t-yosemite-r7-0130
-t-yosemite-r7-0068
-t-yosemite-r7-0048
-t-yosemite-r7-0045
-t-w1064-ix-117 
-t-w1064-ix-312
-t-w1064-ix-313
-t-w732-ix-131
-t-w864-ix-037
-t-w732-ix-130
-t-w732-ix-107 
-t-w732-ix-120
-t-w732-ix-105
-t-w732-ix-096
-t-w732-ix-081
-t-w732-ix-047 
-t-w732-ix-056
-t-w732-ix-054
-t-w732-ix-041
-t-w732-ix-037
-t-w732-ix-011 
-t-w732-ix-031
-t-w732-ix-033
-t-w732-ix-030
-t-w732-ix-022
-t-w732-ix-016

>Attempting SSH reboot...Failed.
>Attempting IPMI reboot...Failed.
>Machine is unreachable, manual intervention require

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Updated

•

7 years ago

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 1

•

7 years ago

Also t-yosemite-r7-402

Blocks: t-yosemite-r7-402

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 2

•

7 years ago

Also the list bellow:

-t-w732-ix-104 
-t-w732-ix-027
-t-w732-ix-003 
-t-w732-ix-020
-t-w732-ix-024
-t-w732-ix-040
-t-w732-ix-073 
-t-w732-ix-087
-t-w732-ix-086
-t-w732-ix-091
-t-w732-ix-098
-t-w732-ix-065
-t-w732-ix-106
-t-w732-ix-122
-t-w732-ix-141
-t-w732-ix-111

Blocks: t-w732-ix-111, t-w732-ix-141, t-w732-ix-122, t-w732-ix-106, t-w732-ix-065, t-w732-ix-098, t-w732-ix-091, t-w732-ix-086, t-w732-ix-087, t-w732-ix-073, t-w732-ix-040, t-w732-ix-024, t-w732-ix-020, t-w732-ix-003, t-w732-ix-027, t-w732-ix-104

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Updated

•

7 years ago

Blocks: t-yosemite-r7-0229

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Updated

•

7 years ago

Blocks: t-yosemite-r7-0279

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Comment 3

•

7 years ago

Also:

-t-yosemite-r7-0110  
-t-yosemite-r7-0108

Blocks: t-yosemite-r7-0108, t-yosemite-r7-0110

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Reporter

Updated

•

7 years ago

Blocks: 1409743

Jordan Lund (:jlund)

Comment 4

•

7 years ago

fubar, it was proposed by arr that we stop filing dcops bugs for machines that were no longer reachable (buildduty can't recover). Do you have context here? One of the primary concerns is to keep low pool counts from getting lower, e.g. yosemite and xp. Reason being as they are our latest pending bottlenecks[1]. Perhaps we should be making exceptions? Any thoughts from you or anyone watching this component?

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1409439

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 5

•

7 years ago

I don't have context, unfortunately. I'd like to understand why machines are going unreachable; that seems like something is very broken.  

OTOH, I expect that response to this bug, at the very least, will be delayed by :van being in MDC1 setting up the minis from move train #2.

Flags: needinfo?(klibby)

Jordan Lund (:jlund)

Comment 6

•

7 years ago

okay thanks. It usually is a sign they are very broken, yes. Unfortunately though, this is normal and an essential escalation step in recovering the hardware machines we have.

Thank you for letting us know about expected delays, buildduty can act appropriately. If anyone with context to dcops and the data center migration could give us a estimated timeline to actioning this bug, that would be great.

Van Le [:van]

Assignee

Comment 7

•

7 years ago

oh wow, there's about 30 machines just in this bug. i'll be on site today and perhaps tomorrow to catch up on these bugs.

Van Le [:van]

Assignee

Updated

•

7 years ago

Assignee: server-ops-dcops → vle

:dhouse

Comment 8

•

7 years ago

Thank you Van! If you need anything tested or confirmed on the systems or in the logs after boot/image on these, please ping me. I'd like to help if I can.

Jake Watkins [:dividehex]

Comment 9

•

7 years ago

:van, can you put a spider kvm on one of those yosemite hosts for me so I can try to determine why it's network is going unresponsive. (Assuming they aren't completely hung up)

Van Le [:van]

Assignee

Comment 12

•

7 years ago

:dividehex, i've attached a spider to one of the minis - 10.26.52.254.

going through the list but so far all the minis ive come across are running into the local fw issue. im reimaging them and will have a more comprehensive update. 

in comment 0, you have -t-yosemite-r7-0225 listed twice, did you mean another host is down?

Comment 13

•

7 years ago

:dividehex/markco, it looks like the win7 testers are also running into a APIPA issue. i've had to reimage quite a few today and will continue troubleshooting the rest of the win7 nodes tomorrow. do you want me to attach a kvm or are you able to use ipmi's console redirection?

Decommission:
-t-w1064-ix-138 - these 4 are same chassis, bad backplane, out of warranty
-t-w1064-ix-139 - these 4 are same chassis, bad backplane, out of warranty
-t-w1064-ix-312 - these 4 are same chassis, bad backplane, out of warranty
-t-w1064-ix-312 - these 4 are same chassis, bad backplane, out of warranty

back online:
-t-yosemite-r7-0229 - local fw issue, attached spider kvm for troubleshooting
-t-yosemite-r7-0279 - local fw issue, reimaged
-t-yosemite-r7-0225 - local fw issue, reimaged
-t-yosemite-r7-0137 - local fw issue, reimaged
-t-yosemite-r7-0130 - local fw issue, reimaged
-t-yosemite-r7-0068 - local fw issue, reimaged
-t-yosemite-r7-0048 - local fw issue, reimaged
-t-yosemite-r7-0045 - local fw issue, reimaged
-t-yosemite-r7-0110 - local fw issue, reimaged
-t-yosemite-r7-0108 - local fw issue, reimaged
-t-w1064-ix-117 - back online
-t-w732-ix-131 - private IP addressing, reimaged
-t-w864-ix-037 - back online
-t-w732-ix-130 - private IP addressing, reimaged

pending:
-t-yosemite-r7-402 - MDC1 node, tracked in bug 1409281, opened QTS REQ0194461


still need to troubleshoot the following nodes tomorrow:
-t-w732-ix-107 
-t-w732-ix-120
-t-w732-ix-105
-t-w732-ix-096
-t-w732-ix-081
-t-w732-ix-047 
-t-w732-ix-056
-t-w732-ix-054
-t-w732-ix-041
-t-w732-ix-037
-t-w732-ix-011 
-t-w732-ix-031
-t-w732-ix-033
-t-w732-ix-030
-t-w732-ix-022
-t-w732-ix-016
-t-w732-ix-104 
-t-w732-ix-027
-t-w732-ix-003 
-t-w732-ix-020
-t-w732-ix-024
-t-w732-ix-040
-t-w732-ix-073 
-t-w732-ix-087
-t-w732-ix-086
-t-w732-ix-091
-t-w732-ix-098
-t-w732-ix-065
-t-w732-ix-106
-t-w732-ix-122
-t-w732-ix-141
-t-w732-ix-111

Flags: needinfo?(jwatkins)

Van Le [:van]

Assignee

Updated

•

7 years ago

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 14

•

7 years ago

Van: Can you change the bios graphic card priority on 3 of the w732 machines? Once that is done I can connect through IPMI and do some trouble shooting.

Flags: needinfo?(mcornmesser)

:dhouse

Comment 15

•

7 years ago

Van, could you check the minis #219 and #120 in scl3, and #444 in mdc1? I think that duplicate entry for t-yosemite-r7-225 was meant for t-yosemite-r7-219

SCL3:
https://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/status.cgi?hostgroup=t-yosemite-r7-machines&style=detail&limit=100&sorttype=1&sortoption=6&sorttype=2&sortoption=6
Shows these three as down for 10h+:
(expected, on kvm) t-yosemite-r7-0229.test.releng.scl3.mozilla.com  20d 22h 6m 52s 	30/30 	PING CRITICAL - Packet loss = 100% 
t-yosemite-r7-0219.test.releng.scl3.mozilla.com  11d 1h 5m 10s 	30/30 	PING CRITICAL - Packet loss = 100% 
t-yosemite-r7-0120.test.releng.scl3.mozilla.com  0d 15h 0m 48s 	30/30 	PING CRITICAL - Packet loss = 100% 

MDC1:
https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=t-yosemite-r7-machines&style=detail&limit=100&sorttype=1&sortoption=6&sorttype=2&sortoption=6
t-yosemite-r7-444.test.releng.mdc1.mozilla.com  2d 5h 22m 19s 	3/3 	PING CRITICAL - Packet loss = 100% 
(loaner, I think expected) t-yosemite-r7-393.test.releng.mdc1.mozilla.com

Details
These are the mac minis that were down yesterday:
12 minis were non-responsive. Nagios showed 11 with 2 to 19 days unresponsive (plus the one in mdc1)
t-yosemite-r7-0045 reported 10/17. hw reboot fixed 08/02, 07/21, 06/30, 02/13, 2016: 12/21(found off), 09/26, 07/13, 07/05
t-yosemite-r7-0048 reported 10/17
t-yosemite-r7-0068 10/17. hardware reboot fixed 08/08, 08/02, 06/06, 04/24, 02/27, 02/22
t-yosemite-r7-0108 10/17. hardware reboot fixed 08/07, 08/02
t-yosemite-r7-0110 10/17
t-yosemite-r7-0130 10/17. reimage fixed 09/26 (host-based firewall?), hw reboot fixed 08/07, 03/06, 02/08
t-yosemite-r7-0137 10/17. earlier reports cancelled 06/26, 12/22
t-yosemite-r7-0219 10/17 (linked as a bug but not listed in a comment)
t-yosemite-r7-0225 10/17. reimage fixed 02/06
t-yosemite-r7-0229 10/17. reimage fixed 09/28 (host-based firewall?), reimaged 09/07, cancelled 08/10, hw reboot fixed 08/02, 02/22, 01/17, 2016: 12/15, 11/28, hw reboot fixed 05/18, loaner 05/12-05/18
t-yosemite-r7-0279 10/17. hardware reboot fixed 04/24
t-yosemite-r7-402 reported 10/16 (found working)
t-yosemite-r7-444 reported 10/18 (linked as a bug but not listed in a comment)

Flags: needinfo?(vle)

Van Le [:van]

Assignee

Comment 16

•

7 years ago

:dhouse, can you open a separate bug for MDC1 minis? it makes it much easier to track since they are different data centers. i'm on site and will continue troubleshooting.

Flags: needinfo?(vle)

:dhouse

Comment 17

•

7 years ago

(In reply to Van Le [:van] from comment #13)
> pending:
> -t-yosemite-r7-402 - MDC1 node, tracked in bug 1409281, opened QTS REQ0194461
> 

Thx Van! The only problem MDC1 mini is covered already in bug https://bugzilla.mozilla.org/show_bug.cgi?id=1409743 with QTS REQ0194461

So that doesn't need to be handled in this bug.

Van Le [:van]

Assignee

Comment 18

•

7 years ago

took care of all the yosemite and w7 left in the bug. can you open new bugs for any future bad hosts? the "one bug to track them all" makes it hard to chase down/look up their past issues.

:markco, i changed resolution to 3 hosts as requested.

t-yosemite-r7-0219 - local fw issue, reimaged
t-yosemite-r7-0120 - local fw issue, reimaged

t-w732-ix-107 - APIPA, rebooted and came back online
t-w732-ix-120 - APIPA, rebooted with no luck, changed video output for mark to troubleshoot
t-w732-ix-105 - APIPA, rebooted with no luck, changed video output for mark to troubleshoot
t-w732-ix-096 - APIPA, rebooted with no luck, changed video output for mark to troubleshoot
t-w732-ix-081 - APIPA, rebooted with no luck, reimaged
t-w732-ix-047 - APIPA, rebooted with no luck, reimaged
t-w732-ix-056 - APIPA, rebooted with no luck, reimaged
t-w732-ix-054 - APIPA, rebooted with no luck, reimaged
t-w732-ix-041 - bad drive, reimaged
t-w732-ix-037 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-011 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-031 - APIPA, rebooted with no luck, reimaged
t-w732-ix-033 - APIPA, rebooted with no luck, reimaged
t-w732-ix-030 - APIPA, rebooted with no luck, reimaged
t-w732-ix-022 - bad drive, reimaged
t-w732-ix-016 - APIPA, rebooted with no luck, reimaged
t-w732-ix-104 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-027 - APIPA, rebooted with no luck, reimaged
t-w732-ix-003 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-020 - APIPA, rebooted with no luck, reimaged
t-w732-ix-024 - APIPA, rebooted with no luck, reimaged
t-w732-ix-040 - APIPA, rebooted with no luck, reimaged
t-w732-ix-073 - APIPA, rebooted with no luck, reimaged 
t-w732-ix-087 - APIPA, rebooted with no luck, reimaged
t-w732-ix-086 - APIPA, rebooted with no luck, reimaged
t-w732-ix-091 - bad drive, reimaged
t-w732-ix-098 - APIPA, rebooted with no luck, reimaged
t-w732-ix-065 - APIPA, rebooted with no luck, reimaged
t-w732-ix-106 - APIPA, rebooted with no luck, reimaged
t-w732-ix-122 - APIPA, rebooted with no luck, reimaged
t-w732-ix-141 - APIPA, rebooted with no luck, reimaged
t-w732-ix-111 - bad drive, reimaged

Flags: needinfo?(mcornmesser)

Van Le [:van]

Assignee

Comment 19

•

7 years ago

to note, the bad drives in previous comment were swapped with drives from machines we decommissioned (same manufacturer/model). all of these machines are out of warranty and the drives are hit or miss with the manufacturer warranty.

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 20

•

7 years ago

I took a look at t-w732-ix-120. I suspect that the machines weren't able to communicate on a boot and that is what caused the issue. They were probably looping back to themselves for DNS as well. Which is why a reboot did not fix it.

Flags: needinfo?(mcornmesser)

Van Le [:van]

Assignee

Comment 21

•

7 years ago

:markco, why is this happening in the first place? are we able to fix the issue or is reimaging going to be the fix for the APIPA issues?

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 22

•

7 years ago

(In reply to Van Le [:van] from comment #21)
> :markco, why is this happening in the first place? are we able to fix the
> issue or is reimaging going to be the fix for the APIPA issues?

I will dive through the logs on the local machine tomorrow to see if I can see any type of root cause. 

Once the machine reaches this state reimaging will be the most straight forward way to get it back into the pool.

Flags: needinfo?(mcornmesser)

Jake Watkins [:dividehex]

Updated

•

7 years ago

Depends on: 1411094

Jake Watkins [:dividehex]

Comment 23

•

7 years ago

I'm going to go out on a limb here and make an assumption since I don't really have the time capacity to pin down the cause.  My best guess regarding the APIPA issue is that some how the local firewall is interfere with the dhcp exchange.  I'm not sure if it stems from connection state tables to fail to keep track of the outgoing packets (such as a timeout or excessive delay between responses) or maybe it doesn't handle a new (non-renewal) lease properly since there is no IP and the returning packet is from a unicast ip while the outgoing packet recorded in the state table was to broadcast.

I'm just not sure.

Anyway, if we make this assumption we should be able to workaround it with a more permissive firewall rule to allow dhcp regardless of state tracking.

This also applies to bug 1401601.

markco, Q: you will need to make a rule under the gpo to do the same.  Basically on ingress, allow udp from any ip on source port 67 to any ip destination port 68.

Jake Watkins [:dividehex]

Comment 24

•

7 years ago

Attached patch Allow dhcp client exchange — Details — Splinter Review

Flags: needinfo?(jwatkins)

Attachment #8921280 - Flags: review?(dhouse)

:dhouse

Updated

•

7 years ago

Attachment #8921280 - Flags: review?(dhouse) → review+

Jake Watkins [:dividehex]

Comment 25

•

7 years ago

Comment on attachment 8921280 [details] [diff] [review]
Allow dhcp client exchange

https://hg.mozilla.org/build/puppet/rev/9b1347a173465832858425e18cf3dec896a4d940
https://hg.mozilla.org/build/puppet/rev/bdb344e213910583c09dea72b4ef412b5c60a90e

Attachment #8921280 - Flags: checked-in+

Van Le [:van]

Assignee

Updated

•

7 years ago

Comment 26

•

7 years ago

closing out bug, please reopen if you need any additional hands on with any of the 50ish hosts listed in this bug.

Van Le [:van]

Assignee

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Radu Iman[:riman]

Updated

•

6 years ago

Updated

•

6 years ago

Updated

•

6 years ago

Updated

•

6 years ago