Closed Bug 1464064 Opened 6 years ago Closed 6 years ago

Moonshot Linux nodes stop functioning

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arny, Assigned: dhouse)

References

Details

We have found that the bellow Linux servers are not visible in TC and not taking jobs. We will re-image them and update this bug.

t-linux64-ms-193
t-linux64-ms-279
t-linux64-ms-280
t-linux64-ms-484
t-linux64-ms-495
t-linux64-ms-527
t-linux64-ms-580
Can we get a link to the papertrail logs for some of these? This sounds suspiciously like what we're seeing on the w10 nodes, where they suddenly stop working.
Blocks: 1464073
Blocks: 1464080
good results for t-linux-ms-495 (https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-495). I think we need to do the same reimage and then watch and possibly cold-reboot if first boot after reimage gets stuck.

I'll check 279 and 280 next (that get stuck at pxeboot menu).
(In reply to Kendall Libby [:fubar] from comment #1)
> Can we get a link to the papertrail logs for some of these? This sounds
> suspiciously like what we're seeing on the w10 nodes, where they suddenly
> stop working.

Here is all of them in papertrail: https://papertrailapp.com/groups/6937292/events?q=t-linux64-ms-193%20OR%20%20t-linux64-ms-279%20OR%20%20t-linux64-ms-280%20OR%20%20t-linux64-ms-484%20OR%20%20t-linux64-ms-495%20OR%20%20t-linux64-ms-527%20OR%20%20t-linux64-ms-580&focus=936334690644828165

Not much for the ones that are sticking at the pxeboot however. But ones like #495 show logs once they are repaired: https://papertrailapp.com/systems/1899813261/events?focus=936334699754856501
There is a tracking bug for hardware issues on the moonshots:

https://bugzilla.mozilla.org/show_bug.cgi?id=1428159

If we find a hardware issue with any of these we can address it through that bug.
See Also: → 1428159
stuck on pxeboot:
t-linux64-ms-193  (after failing pxeboot, goes into xen currently)
t-linux64-ms-279  (after failing pxeboot, goes into ubuntu but without tc-worker running)
t-linux64-ms-280  (after failing pxeboot, goes into ubuntu but without tc-worker running)

fixed by reimage:
t-linux64-ms-484
t-linux64-ms-495
t-linux64-ms-527

okay: 571-580 are not in production. they are a development set
t-linux64-ms-580 (we expect this to be off or not running tc-worker
t-linux64-ms-488 also came up as not running the tc worker. This one needs to be reimaged as it has a problem with its puppet certificate and cannot update its puppet config.
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-488
I've done the reimage on t-linux-ms-488. Seems like it appears in TC and it's taking tasks. We still cannot SSH into it.
The bellow Linux servers where not present in the TC list, however, I was able to see each ones tasks. I have rebooted all and they run tasks successfully.
 
t-linux64-ms-007
t-linux64-ms-057
t-linux64-ms-141
t-linux64-ms-183
t-linux64-ms-189
t-linux64-ms-493
(In reply to Attila Craciun [:arny] from comment #8)
> The bellow Linux servers where not present in the TC list, however, I was
> able to see each ones tasks. I have rebooted all and they run tasks
> successfully.
>  
> t-linux64-ms-007
> t-linux64-ms-057
> t-linux64-ms-141
> t-linux64-ms-183
> t-linux64-ms-189
> t-linux64-ms-493

PXE is not working also for this machines.
The bellow servers where not visible in TC. After checking them, all where stuck at grub menu. Rebooted them,  they show up in TC, running and completing jobs successfully.

 t-linux64-ms-272
 t-linux64-ms-273
 t-linux64-ms-274
 t-linux64-ms-275 (need firmware upgrade bug 1464044)
 t-linux64-ms-276
 t-linux64-ms-277
(In reply to Attila Craciun [:arny] from comment #9)
> (In reply to Attila Craciun [:arny] from comment #8)
> > The bellow Linux servers where not present in the TC list, however, I was
> > able to see each ones tasks. I have rebooted all and they run tasks
> > successfully.
> >  
> > t-linux64-ms-007
> > t-linux64-ms-057
> > t-linux64-ms-141
> > t-linux64-ms-183
> > t-linux64-ms-189
> > t-linux64-ms-493
> 
> PXE is not working also for this machines.

I don't have PXE working yet for the mdc1 moonshots.
(In reply to Dave House [:dhouse] from comment #11)
> (In reply to Attila Craciun [:arny] from comment #9)
> > (In reply to Attila Craciun [:arny] from comment #8)
> > > The bellow Linux servers where not present in the TC list, however, I was
> > > able to see each ones tasks. I have rebooted all and they run tasks
> > > successfully.
> > >  
> > > t-linux64-ms-007
> > > t-linux64-ms-057
> > > t-linux64-ms-141
> > > t-linux64-ms-183
> > > t-linux64-ms-189
> > > t-linux64-ms-493
> > 
> > PXE is not working also for this machines.
> 
> I don't have PXE working yet for the mdc1 moonshots.

I changed all of the linux nodes on the mdc1 and mdc2 moonshots to boot from their local hard-disks instead of doing PXE-boot first. So, this can prevent machines that reboot from wasting time trying to pxe-boot.
As an update on linux moonshots:

t-linux64-ms-193 and t-linux64-ms-275 are out of service. This is also stated in the MS document

t-linux64-ms-279 and t-linux64-ms-280 were missing from TC. I re-imaged them, the process went all the way. Machine 279 was assigned to dividehex when it was last broken as per bug 1435020.

t-linux64-ms-394 however won't even go through PXE boot. Looks like it's trying but keeps getting back to the beginning of PXE boot.
t-linux64-ms-257 - rebooted, was not present in TC. Now is back in business.
I see that t-linux64-ms-394 is working: https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394

Danut, could you have someone on your team check over the others to see if they are in the same state now or fixed?
(In reply to Dave House [:dhouse] from comment #15)
> I see that t-linux64-ms-394 is working:
> https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/
> gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
> 
> Danut, could you have someone on your team check over the others to see if
> they are in the same state now or fixed?

:dhouse, 279 and 280 are missing once again from TC . Shall we keep them in that state for further investigation or shall we re-image 'em once again?
Flags: needinfo?(dhouse)
(In reply to Roland Mutter Michael (:rmutter) from comment #16)
> (In reply to Dave House [:dhouse] from comment #15)
> > I see that t-linux64-ms-394 is working:
> > https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/
> > gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
> > 
> > Danut, could you have someone on your team check over the others to see if
> > they are in the same state now or fixed?
> 
> :dhouse, 279 and 280 are missing once again from TC . Shall we keep them in
> that state for further investigation or shall we re-image 'em once again?

:rmutter, please re-image them once again. If this repeats, we can review the logs to see what has happened to cause them to stop taking jobs.
Flags: needinfo?(dhouse)
Adding some new nodes seeing this problem from #ci:
> 21:32:51 <&riman|ciduty> Hello dhouse: The following  t-linux64-ms-(351, 356, 357, 436) workers are missing from Taskcluster. I have tried to re-image them but they all remain stuck after F12. Could you take a look, please?

t-linux64-ms-351
t-linux64-ms-356
t-linux64-ms-357
t-linux64-ms-436
Assignee: relops → dhouse
(In reply to Dave House [:dhouse] from comment #18)
> Adding some new nodes seeing this problem from #ci:
> > 21:32:51 <&riman|ciduty> Hello dhouse: The following  t-linux64-ms-(351, 356, 357, 436) workers are missing from Taskcluster. I have tried to re-image them but they all remain stuck after F12. Could you take a look, please?
> 
> t-linux64-ms-351
> t-linux64-ms-356
> t-linux64-ms-357
> t-linux64-ms-436

I see the same hanging at "Booting PXE over IPv4" in mdc2 chassis 8, 9 and 11 (10, 12, 13, 14 are not having this problem). I spot-checked across all of these mdc2 chassis.
 
Also, the 4 above are never ping-able (and while spot-checking i found t-linux64-ms-346 had this problem 2 of 3 times I rebooted it. so I think this may be intermittent across others also); When I boot them from their local ubuntu install, they go into the "raise the network interfaces" waiting period and then give up without network (and are not pingable).

So I tried changing back to the vm admin hosts for pxeboot and the failing machines still did not get farther in pxeboot (so I reverted that back to the correct new admin hosts for pxe/tftp).
Through troubleshooting in #systems, Van found that the problem chassis needed their 2nd switches restarted; See https://mana.mozilla.org/wiki/display/NETOPS/HP+Switch+Configuration#HPSwitchConfiguration-12.Troubleshooting

```
If you see the switches/chassis complaining of a duplicate IP, that means the switch may have lost its IRF config and will need to be rebooted.
ex: Duplicate address 10.51.16.34 on interface M-GigabitEthernet0/0/0, sourced from 9cb6-54fe-7cca
```

He fixed 8,9,11 moon chassis and I confirmed by pxebooting two machines from each chassis.

I need to check through all of the linux cartridges on these three chassis to make sure none are left thinking that they have no network (or needing reimaged).
We also need to set up some sort of monitoring to be alerted if the switch problem happens again (since we do not know what caused it).
Went for a full check of linux moonshots that does appear in TC. Seems like the following machines are not in TC :
t-linux64-ms-141
t-linux64-ms-351
t-linux64-ms-356
t-linux64-ms-357
t-linux64-ms-436

Will proceed with a reboot for every machine. If that doesn't work , I'll start a reimage for each one. I'll be back with updates.
(In reply to Roland Mutter Michael (:rmutter) from comment #22)
> Went for a full check of linux moonshots that does appear in TC. Seems like
> the following machines are not in TC :
> t-linux64-ms-141
> t-linux64-ms-351
> t-linux64-ms-356
> t-linux64-ms-357
> t-linux64-ms-436
> 
> Will proceed with a reboot for every machine. If that doesn't work , I'll
> start a reimage for each one. I'll be back with updates.

Thank you! I appreciate your work on these.
After rebooting the machines, the following and the candidates for reimage:
t-linux64-ms-351
t-linux64-ms-356
t-linux64-ms-357
t-linux64-ms-436
:dhouse I saw from previous shifts that reimaging for 356, 357 and 436 is disabled. Please ping us whenever they are ready for the reimage. For now, Adrian will reimage t-linux64-ms-351.
(In reply to Roland Mutter Michael (:rmutter) from comment #25)
> :dhouse I saw from previous shifts that reimaging for 356, 357 and 436 is
> disabled. Please ping us whenever they are ready for the reimage. For now,
> Adrian will reimage t-linux64-ms-351.

Thankyou. We were able to get the reimaging fixed (the network switches in moon8/9/11 had lost some config and had to be reconfigured). 

I'll reimage 356,357,436 to make sure that works on them.
I've reimaged t-linux64-ms-351, later it needs to be checked if it takes jobs
(In reply to Adrian Pop from comment #29)
> I've reimaged t-linux64-ms-351, later it needs to be checked if it takes jobs

Looks good in TC: https://tools.taskcluster.net/groups/W1Mde5F9Rpm8QrPEqMl2Hg/tasks/PEvc-MzET6KLNaGyaclDWg/runs/0
All of the machines reported in this bug are accounted for and working correctly now (279 and 280 are loaners, all others were in a good state):

https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-007
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-057
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-141
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-183
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-189
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-193
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-272
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-273
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-274
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-275
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-276
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-277
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-279
Worker not found: https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc1/t-linux64-ms-280
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-351
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-356
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-357
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-394
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-436
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-484
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-493
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-495
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-527
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-580

The two missing from taskcluster are 279,280:
t-linux64-ms-279.test.releng.mdc1.mozilla.com https://moon-chassis-7.inband.releng.mdc1.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c9n1
t-linux64-ms-280.test.releng.mdc1.mozilla.com https://moon-chassis-7.inband.releng.mdc1.mozilla.com/#/node/show/overview/r/rest/v1/Systems/c10n1

279 was repaired by re-seating the cartridge in bug 1435020 (created bug 1472727 this morning to track it as a loaner)
280 is a loaner for Dragos (see bug 1464070)
No longer blocks: t-linux64-ms-280
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Depends on: 1473589
gonna keep  tracking and updating this bug with linux machines that fail. 
t-linux64-ms-527 <-- rebooted, reimaged, back in TC, waiting for jobs.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
527 looks good. 580 is a dev machine
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-linux-talos/workers/mdc2/t-linux64-ms-527
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
I've re-imaged a large amount of linux moonshot machines which apparently all failed in under 12h:

linux-{001, 003, 005, 008, 011, 014, 015, 047, 049, 051, 054, 056, 058, 092, 094, 097, 098, 100, 101, 103, 104, 136, 137, 140, 141, 144, 146, 148, 149, 181, 183, 185, 187, 188, 192, 193, 195, 226, 227, 232, 235, 237, 239, 271, 272, 273, 275, 276, 277, 279, 346, 353, 538}

Dave, could this be related to the firmware upgrade you brought to the moonshots? Also had considerably more W10 workers to deal with today, compared to last 2 weeks.
Status: RESOLVED → REOPENED
Flags: needinfo?(dhouse)
Resolution: FIXED → ---
(In reply to Zsolt Fay [:zsoltfay] from comment #34)
> I've re-imaged a large amount of linux moonshot machines which apparently
> all failed in under 12h:
> 
> linux-{001, 003, 005, 008, 011, 014, 015, 047, 049, 051, 054, 056, 058, 092,
> 094, 097, 098, 100, 101, 103, 104, 136, 137, 140, 141, 144, 146, 148, 149,
> 181, 183, 185, 187, 188, 192, 193, 195, 226, 227, 232, 235, 237, 239, 271,
> 272, 273, 275, 276, 277, 279, 346, 353, 538}
> 
> Dave, could this be related to the firmware upgrade you brought to the
> moonshots? Also had considerably more W10 workers to deal with today,
> compared to last 2 weeks.

In the last 48h, 79 linux moonshots have been re-imaged but not recovered from that state.
At this point I'm afraid that the deployment process is bad, since some of the machines have been re-imaged at least once, most of them 2-3 times (tracking for that can be found in the following doc https://docs.google.com/spreadsheets/d/1A6fU2t3rVY2oAd-U26k4lPZGjfULnh6w5XySqsMofUM) and still appear in a bad state.
At this point, the deploy process looks to be normal, no major reason why they fail to take tasks.
I'll continue investigate the issue and report in if I'll find something obvious.
Later updates, after looking into services the first 3 machines I checked have this process (719) running:

[root@t-linux64-ms-005 ~]# ps -ef | grep puppet
root       719   714  0 03:09 ?        00:00:00 /bin/bash /root/puppetize.sh
root      5428  5416  0 17:22 pts/0    00:00:00 grep --color=auto puppet
[root@t-linux64-ms-005 ~]#

Doing 
> cat last_run_report.yaml |grep fail
we got
> status: failed

Also:
[root@t-linux64-ms-005 state]# cat /var/lib/puppet/state/last_run_summary.yaml
---
  version:
    config: remotes/origin/HEAD
    puppet: "3.8.5"
  resources:
    changed: 4
    failed: 1
    failed_to_restart: 0
    out_of_sync: 5
    restarted: 0
    scheduled: 0
    skipped: 1
    total: 456
  time:
    anchor: 0.004462988
    augeas: 0.291225276
    config_retrieval: 14.035952311998699
    exec: 0.374154974
    file: 1.0897829879999996
    filebucket: 8.704e-05
    firewall: 0.009833629000000002
    firewallchain: 0.001217386
    group: 0.000156733
    host: 0.000386595
    package: 4.367681283000001
    resources: 0.000127482
    schedule: 0.00052634
    service: 0.8251894579999998
    sysctl: 0.000188186
    total: 21.0020116199987
    user: 0.0010389499999999999
    last_run: 1537144918
  changes:
    total: 4
  events:
    failure: 1
    success: 4
    total: 5
[root@t-linux64-ms-005 state]#


and from papertrail we got this

>  message: "change from stopped to running failed: Could not start Service[mig-agent]: Execution of '/bin/systemctl start mig-agent' returned 5: Failed to start mig-agent.service: Unit mig-agent.service not found."

and puppetize.log contains a lot of:
> Running puppet agent against server 'puppet'
> Puppet run failed; re-trying after 10m

Also started the puppet service on the first machine (t-linux64-ms-001) and looked into papertrail and found this:
> Sep 16 18:04:59 t-linux64-ms-001.test.releng.mdc1.mozilla.com puppet-agent: (/File[/var/lib/puppet/lib]) Could not evaluate: Could not retrieve file metadata for puppet://releng-puppet2.srv.releng.scl3.mozilla.com/plugins: Failed to open TCP connection to releng-puppet2.srv.releng.scl3.mozilla.com:8140 (Connection timed out - connect(2) for "releng-puppet2.srv.releng.scl3.mozilla.com" port 8140) 

Is that server being shutdown??
bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com
releng-puppet2.srv.releng.scl3.mozilla.com is unreachable
(In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #36)

> Is that server being shutdown??
> bcrisan@bcrisan-P6198:~$ fping releng-puppet2.srv.releng.scl3.mozilla.com
> releng-puppet2.srv.releng.scl3.mozilla.com is unreachable

I'm going to answer to that, Yes it is down and probably the mdc1 puppet server should be used because the workers are in MDC1
I've tried to reboot the following, all of them were powered off. After Power on the machines restarted a few times without any success on booting up OS. After a few restarts all off them got on power off state :

t-linux64-ms-272
t-linux64-ms-273
t-linux64-ms-276
t-linux64-ms-277
(In reply to Dave House [:dhouse] from comment #39)
> We have the moonshots configured to power off after 3 failed boots. 

If we're going to stick with that then we should be getting alerts when that happens, either from nagios or iLO.
You need to log in before you can comment on or make changes to this bug.