1141416 - Fix the slaves broken by talos's inability to deploy an update

Reporter

Description

•

9 years ago

Not sure what will be an actual successful fix, since as I remember the "fix" for when we last added a new talos chunk, when we tried to fix the problem by reimaging the slaves that didn't pick up the new version of talos, we wound up with slaves with an even more broken version of talos.

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux32-ix-008

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-003

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-017

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-092

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-055

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-118

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-099

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-004

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-008

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-027

Jordan Lund (:jlund)

Updated

•

9 years ago

Depends on: 1112773

Jordan Lund (:jlund)

Comment 1

•

9 years ago

looks like 1112773 just needs to land. poked bug

Chris Cooper [:coop] (he/him)

Updated

•

9 years ago

Blocks: talos-linux32-ix-026

Chris Cooper [:coop] (he/him)

Updated

•

9 years ago

Blocks: talos-linux32-ix-001

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: t-snow-r4-0005

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: t-snow-r4-0025

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: t-snow-r4-0048

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux32-ix-022

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: t-snow-r4-0070

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: t-snow-r4-0063

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: t-snow-r4-0074

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-080

Jordan Lund (:jlund)

Comment 2

•

9 years ago

1112773 is resolved. I suspect this is now fixed as we *should* be using the cloned checkout of talos and not the talos module baked into the venv.

Chris Cooper [:coop] (he/him)

Comment 3

•

9 years ago

(In reply to Jordan Lund (:jlund) from comment #2)
> 1112773 is resolved. I suspect this is now fixed as we *should* be using the
> cloned checkout of talos and not the talos module baked into the venv.

I will go through the list of Linux slaves in this bug tomorrow and re-image them all. Whee!

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-001

Chris Cooper [:coop] (he/him)

Comment 4

•

9 years ago

I've re-imaged talos-linux32-ix-00[1,3,8]. 003 failed the first job it took:

http://buildbot-master104.bb.releng.scl3.mozilla.com:8201/builders/Ubuntu%20HW%2012.04%20try%20talos%20g2/builds/392

I'm holding off re-imaging the rest until we have at least a few successful runs on this first batch.

Chris Cooper [:coop] (he/him)

Comment 5

•

9 years ago

(In reply to Chris Cooper [:coop] from comment #4)
> I've re-imaged talos-linux32-ix-00[1,3,8]. 003 failed the first job it took:

All of these are timing out after an hour without output while trying to run talos. :/

Jordan Lund (:jlund)

Comment 6

•

9 years ago

(In reply to Chris Cooper [:coop] from comment #5)
> (In reply to Chris Cooper [:coop] from comment #4)
> > I've re-imaged talos-linux32-ix-00[1,3,8]. 003 failed the first job it took:
> 
> All of these are timing out after an hour without output while trying to run
> talos. :/

for those with access, here are the three failed jobs:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux32-ix-003

here is what I think is happening:

1) the first job was a try job that used a pin mh rev (321d9dcec7b2) that comes before my fix for this bug[1]. Which means that we populated the python venv with a talos installation and so we fail like before bc it's using the venv not the talos repo:

11:50:00     INFO - Calling ['/builds/slave/test/build/venv/bin/talos', '--noisy', # ... etc

2) then, even though the next job using the current m-i mh pin with the required fix, we end up with a corrupt venv that has old talos stale data. Essentially, we call PerfConfig/run_tests.py directly from the 'bad' python interpreter

13:42:36     INFO - Calling ['/builds/slave/test/build/venv/bin/python', '/builds/slave/test/build/talos_repo/talos/PerfConfigurator.py', # etc

In which case, we would have to either clobber the venv before each job or wait until we are 'unlikely' to have someone push to try with a really old mh pin.

jmaher, does that sound about right to you?

[1] http://hg.mozilla.org/build/mozharness/rev/f4520ff7c234

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 7

•

9 years ago

this sounds very plausible to me.  Should we wait ~1 week and then give it a go?

Flags: needinfo?(jmaher)

Chris Cooper [:coop] (he/him)

Comment 8

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #7)
> this sounds very plausible to me.  Should we wait ~1 week and then give it a
> go?

Does it affect the calculus here that some of the failures are happening on mozilla-inbound as well?

e.g. http://buildbot-master104.bb.releng.scl3.mozilla.com:8201/builders/Ubuntu%20HW%2012.04%20mozilla-inbound%20talos%20g1/builds/1436

Chris Cooper [:coop] (he/him)

Updated

•

9 years ago

Blocks: talos-linux32-ix-003

Joel Maher ( :jmaher ) (UTC -8)

Comment 9

•

9 years ago

I don't know how to view that link above or get more information.

Do we still have talos slaves broken on recent inbound/fx-team/mozilla-central builds?

If so, then we need to take one of those failing slaves out and investigate it in more detail.

Jordan Lund (:jlund)

Comment 10

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #9)
> I don't know how to view that link above or get more information.
> 
> Do we still have talos slaves broken on recent
> inbound/fx-team/mozilla-central builds?
> 
> If so, then we need to take one of those failing slaves out and investigate
> it in more detail.

hrm, so the m-i job came after the old try job and in comment 6 I was suggesting that they shared the same venv (/builds/slave/test/build/venv) which carried talos packages from the try job.

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

9 years ago

interesting- why isn't the venv updated?

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Blocks: talos-linux64-ix-002

Jordan Lund (:jlund)

Comment 12

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #11)
> interesting- why isn't the venv updated?

I think it is updated with the new modules we added but what I'm suggesting is that it will still have the 'talos' virtualenv module in it and I suspected that when we call the 'cloned talos checkout' scripts from the venv, we end up using bits from the 'venv talos' package. that was my original guess granted I'm not familiar with talos and the setup.

If we don't have any better ideas, I think it is worth just trying this again on one machine at the end of this week on freshly imaged machines in hopes that we don't have any try runs using an old mh rev still.

I can't track down a public link for the first job anymore (the try old mh based)

here is the second job (the new mh based one that uses cloned talos): https://treeherder.mozilla.org/logviewer.html#?job_id=11105391&repo=mozilla-inbound

Phil Ringnalda (:philor)

Reporter

Comment 13

•

9 years ago

We tried it with talos-linux64-ix-002 this morning, freshly reimaged after a loan, first job it took was on mozilla-inbound, http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1435683703/mozilla-inbound_ubuntu64_hw_test-tp5o-bm105-tests1-linux-build1089.txt.gz

Jordan Lund (:jlund)

Comment 14

•

9 years ago

(In reply to Phil Ringnalda (:philor) from comment #13)
> We tried it with talos-linux64-ix-002 this morning, freshly reimaged after a
> loan, first job it took was on mozilla-inbound,
> http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> inbound-linux64/1435683703/mozilla-inbound_ubuntu64_hw_test-tp5o-bm105-
> tests1-linux-build1089.txt.gz

hrm, and it is using the talos repo. Looks like there is something else at play. Maybe the issue was never that talos wasn't updating..

10:31:54     INFO - Calling ['/builds/slave/test/build/venv/bin/python', '/builds/slave/test/build/talos_repo/talos/PerfConfigurator.py'

joel, can we set you or someone you recommend with a freshly imaged slave again and poke around?

Flags: needinfo?(jmaher)

Phil Ringnalda (:philor)

Reporter

Comment 15

•

9 years ago

There was a time when the issue was that talos wasn't updating, the ones from at least as recently as August 2014  were failing when they tried to run a newly added suite because as far as the talos they were running was concerned that suite did not exist. But, it has been more than a year since we last put a Linux talos slave back in service, so practically anything could have rotted with the image in the meantime.

Joel Maher ( :jmaher ) (UTC -8)

Comment 16

•

9 years ago

Please get me a loaner, and I will look at this

Flags: needinfo?(jmaher)

Chris Cooper [:coop] (he/him)

Updated

•

9 years ago

Depends on: 1181250

Chris Cooper [:coop] (he/him)

Comment 17

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #16)
> Please get me a loaner, and I will look at this

Grabbed talos-linux64-ix-002 for Joel. 

He may still need some help here to work through puppet issues if we can't find something with the harnesses.

Joel Maher ( :jmaher ) (UTC -8)

Comment 18

•

9 years ago

I am not sure what to look for here. I am able to run tests successfully, even tests which are *brand new* and not defined in a production environment.  With that said, looking at the logs, we seem to get PerfConfigurator.py and run_tests.py from the checkout, not the venv.  

Maybe we have an old venv talos sitting around?  We need to use a python from a venv to access modules.

Armen [:armenzg]

Comment 19

•

9 years ago

I looked at the log that philor pasted. It seems that we clobber and use the right thing for the venv.

I get the odd feeling that we could see something through VNC.

I would not mind having a look at a loaner.

10:31:55     INFO -  DEBUG : command line: /builds/slave/test/build/application/firefox/firefox -profile /tmp/tmprFZCOQ/profile http://localhost/getInfo.html

command timed out: 3600 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'tp5o', '--add-option', '--webServer,localhost', '--branch-name', 'Mozilla-Inbound-Non-PGO', '--system-bits', '64', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Mozilla-Inbound-Non-PGO'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=3751.827034

Jordan Lund (:jlund)

Comment 20

•

9 years ago

> I would not mind having a look at a loaner.
> 

thanks Armen. I'm sure Joel wouldn't mind you using his slave https://bugzil.la/1181250 talos-linux64-ix-002 if he is not currently using it. passwords are likely the same as your last loaner. I double checked and you are still on the vpn loaner list.

Jordan Lund (:jlund)

Comment 21

•

9 years ago

> I would not mind having a look at a loaner.
> 


hey armen, have you had a chance to look at this loaner?

Flags: needinfo?(armenzg)

Armen [:armenzg]

Comment 22

•

9 years ago

Not yet. I will leave the NI in place for next week.

Armen [:armenzg]

Comment 23

•

9 years ago

jlund, could you please have a look at a working machine and check for the permission of these files?

[cltbld@talos-linux64-ix-002 test]$ ls -l /usr/lib/mozilla/plugins/
total 364
-rw-r--r-- 1 root root   6048 Jul  9  2012 librhythmbox-itms-detection-plugin.so
-rw-r--r-- 1 root root 100720 Apr  2  2012 libtotem-cone-plugin.so
-rw-r--r-- 1 root root 105424 Apr  2  2012 libtotem-gmp-plugin.so
-rw-r--r-- 1 root root  72040 Apr  2  2012 libtotem-mully-plugin.so
-rw-r--r-- 1 root root  80560 Apr  2  2012 libtotem-narrowspace-plugin.so


I see this in the output of a talos run:
12:20:55     INFO -  DEBUG : command line: /builds/slave/test/build/application/firefox/firefox -profile /tmp/tmpQZ2ppq/profile http://localhost/getInfo.html
12:20:56     INFO -  LoadPlugin: failed to initialize shared library libXt.so [libXt.so: cannot open shared object file: No such file or directory]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library libXext.so [libXext.so: cannot open shared object file: No such file or directory]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/librhythmbox-itms-detection-plugin.so [/usr/lib/mozilla/plugins/librhythmbox-itms-detection-plugin.so: wrong ELF class: ELFCLASS64]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-cone-plugin.so [/usr/lib/mozilla/plugins/libtotem-cone-plugin.so: wrong ELF class: ELFCLASS64]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-mully-plugin.so [/usr/lib/mozilla/plugins/libtotem-mully-plugin.so: wrong ELF class: ELFCLASS64]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-gmp-plugin.so [/usr/lib/mozilla/plugins/libtotem-gmp-plugin.so: wrong ELF class: ELFCLASS64]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-narrowspace-plugin.so [/usr/lib/mozilla/plugins/libtotem-narrowspace-plugin.so: wrong ELF class: ELFCLASS64]
12:20:59     INFO -  __metrics	Screen width/height:1600/1200
12:20:59     INFO -  	colorDepth:24
12:20:59     INFO -  	Browser inner width/height: 1024/697
12:20:59     INFO -  __metrics
12:20:59     INFO -  JavaScript error: resource:///modules/WebappManager.jsm, line 48: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]
12:21:01     INFO -  DEBUG : initialized firefox
12:21:01     INFO -  DEBUG : command line: /builds/slave/test/build/application/firefox/firefox -profile /tmp/tmpQZ2ppq/profile -tp file:/builds/slave/test/build/talos_repo/talos/page_load_test/canvasmark/canvasmark.manifest -tpchrome -tpnoisy -tpcycles 5 -tppagecycles 1

Flags: needinfo?(armenzg)

Chris Cooper [:coop] (he/him)

Comment 24

•

9 years ago

(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #23)
> jlund, could you please have a look at a working machine and check for the
> permission of these files?
> 
> [cltbld@talos-linux64-ix-002 test]$ ls -l /usr/lib/mozilla/plugins/
> total 364
> -rw-r--r-- 1 root root   6048 Jul  9  2012
> librhythmbox-itms-detection-plugin.so
> -rw-r--r-- 1 root root 100720 Apr  2  2012 libtotem-cone-plugin.so
> -rw-r--r-- 1 root root 105424 Apr  2  2012 libtotem-gmp-plugin.so
> -rw-r--r-- 1 root root  72040 Apr  2  2012 libtotem-mully-plugin.so
> -rw-r--r-- 1 root root  80560 Apr  2  2012 libtotem-narrowspace-plugin.so

On talos-linux64-ix-085, I see the same thing:

[cltbld@talos-linux64-ix-085 ~]$ ls -l /usr/lib/mozilla/plugins/
total 364
-rw-r--r-- 1 root root   6048 Jul  9  2012 librhythmbox-itms-detection-plugin.so
-rw-r--r-- 1 root root 100720 Apr  2  2012 libtotem-cone-plugin.so
-rw-r--r-- 1 root root 105424 Apr  2  2012 libtotem-gmp-plugin.so
-rw-r--r-- 1 root root  72040 Apr  2  2012 libtotem-mully-plugin.so
-rw-r--r-- 1 root root  80560 Apr  2  2012 libtotem-narrowspace-plugin.so

Armen [:armenzg]

Comment 25

•

9 years ago

Attached image screenshot of bad slave — Details

I don't know if it is related. Just attaching the screenshot of one of the machines with issues.
The bar on the left is not there.

Armen [:armenzg]

Comment 26

•

9 years ago

We still have a bad slave running (bad svg run):
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1438718104/b2g-inbound_ubuntu32_hw_test-svgr-bm104-tests1-linux-build69.txt.gz

Here's a good svg run:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1438720319/b2g-inbound_ubuntu32_hw_test-svgr-bm105-tests1-linux-build226.txt.gz

From comparing the logs and some string manipulation (:%s/\d\d:\d\d:\d\d//g) the only difference I see on a bad run is this:
warning: s3-us-west-2.amazonaws.com certificate with fingerprint d2::be:33:1e:fa:6b:e2:9f:51:e1:8a:ab::64:e7 not verified (check hostfingerprints or web.cacerts config setting)

Nothing to go there.

Armen [:armenzg]

Comment 27

•

9 years ago

I believe that runner is still running (from /var/log/syslog) about the time that the job was running [1]

Some interesting symptoms:
* talos-linux32-003 was disabled on slavealloc yet taking jobs on production
* About 13:55:09 today it was running a buildbot job and runner was running [1]
* I could not stay connected to that machine for a while. Possible runner kills the ssh server
* The desktop environment looks messed up (see attachment)
** Perhaps runner is killing it or restarting it

These symptoms are not faced by talos-linux64-ix-002 which jmaher was using but it might have been in such state before it was loaned. Perhaps the cleaning process of a loaner fixes the underlying issue.

Could someone please investigate talos-linux32-003 and determine what is going on?
I would have tried to reboot the machine but I would not know if I would be clearing up the issue.

Do we have metrics for this host?

[1]
Aug  5 11:54:59 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-checkout_tools", "result": "OK"}
Aug  5 11:55:00 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-config_hgrc", "result": "RUNNING"}
Aug  5 11:55:01 talos-linux32-ix-003 0-config_hgrc: starting (max time 600s)
Aug  5 11:55:02 talos-linux32-ix-003 0-config_hgrc: OK
Aug  5 11:55:02 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-config_hgrc", "result": "OK"}
Aug  5 11:55:03 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "RUNNING"}
Aug  5 11:55:04 talos-linux32-ix-003 1-cleanslate: starting (max time 600s)
Aug  5 11:55:05 talos-linux32-ix-003 1-cleanslate: OK
Aug  5 11:55:05 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "OK"}
Aug  5 11:55:06 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanup", "result": "RUNNING"}
Aug  5 11:55:07 talos-linux32-ix-003 1-cleanup: starting (max time 600s)
Aug  5 11:55:08 talos-linux32-ix-003 1-cleanup: OK
Aug  5 11:55:08 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanup", "result": "OK"}
Aug  5 11:55:09 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
Aug  5 11:55:10 talos-linux32-ix-003 1-mig_agent: starting (max time 600s)
Aug  5 11:55:17 talos-linux32-ix-003 1-mig_agent: OK

Chris Cooper [:coop] (he/him)

Comment 28

•

9 years ago

(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #27)
> * I could not stay connected to that machine for a while. Possible runner
> kills the ssh server
> * The desktop environment looks messed up (see attachment)
> ** Perhaps runner is killing it or restarting it
> 
> These symptoms are not faced by talos-linux64-ix-002 which jmaher was using
> but it might have been in such state before it was loaned. Perhaps the
> cleaning process of a loaner fixes the underlying issue.

This is expected. buildbot is started via a runner task after it has completed all the pre-flight tasks. The list of tasks is removed as part of the loan, so jmaher wouldn't have seen this.

> Could someone please investigate talos-linux32-003 and determine what is
> going on?
> I would have tried to reboot the machine but I would not know if I would be
> clearing up the issue.

One of the other tasks is cleanslate. cleanslate tries to build a list of allowed running processes and actively terminates things that are not in that list. I think this is likely where this is falling down, i.e. I suspect that cleanslate is killing off things like ssh connections, VNC sessions, possibly even the graphics context required for the tests themselves.

> Do we have metrics for this host?

We have logs in papertrail, but not sure you have access:

https://papertrailapp.com/systems/talos-linux32-ix-003/events

Chris Cooper [:coop] (he/him)

Comment 29

•

9 years ago

I discussed this yesterday with Armen and Rail. From the screenshot it seems that the window manager (Unity)
is not running, so it's either not starting, crashing, or being killed. I'm going to try connecting a talos-linux64 machine to my tests master in staging today and running it with cleanslate disabled to see whether we end up with a functional window manager.

Chris Cooper [:coop] (he/him)

Comment 30

•

9 years ago

I have talos-linux64-ix-004 running against my staging tests master with cleanslate disabled, and it's actually running the test instead of timing out. That's a pretty clear indication to me that cleanslate is being too aggressive.

I'm going to dump the process list from talos-linux64-ix-004 running in this state and compare it against the process list of a machine running cleanslate to narrow down the processes we should be white-listing.

Chris Cooper [:coop] (he/him)

Comment 31

•

9 years ago

...and now I have no idea what's going on. Every single talos test I've run against my tests server has passed, regardless of whether cleanslate has been disabled or not.

I'm going to try re-imaging talos-linux64-ix-004 to see whether I can recreate the standard bustage. I'm way too pessimistic to believe this has actually been fixed.

Armen [:armenzg]

Comment 32

•

9 years ago

Does the window manager get fixed?
If you VNC into a machine after your changes, can you see Firefox/talos running?

Chris Cooper [:coop] (he/him)

Comment 33

•

9 years ago

Attached image Can't find localhost/getInfo.html — Details

After a fresh re-install, the machine (talos-linux64-ix-004) will hang in talos as shown in the attached screenshot. This indicates to me that apache is not starting correctly, or the machine is being marked as ready for service *before* apache is properly installed.

On a hunch, I tried simply rebooting the machine. It took the next job and and had no trouble finding localhost/getInfo.html. This indicates to me that the fix is likely in puppet, i.e. verifying that apache is properly installed before the final reboot that would return the slave to production, or at the very worst simply adding another reboot to that cycle.

I'm going to try to replicate this behavior on another slave before I dive into puppet.

Chris Cooper [:coop] (he/him)

Comment 34

•

9 years ago

(In reply to Chris Cooper [:coop] from comment #33) 
> I'm going to try to replicate this behavior on another slave before I dive
> into puppet.

Was able to replicate on talos-linux64-ix-001 after a re-image, and verified that running "service apache2 restart" was enough to start serving content properly.

Now, to figure out how to invoke an apache restart via puppet...

Chris Cooper [:coop] (he/him)

Comment 35

•

9 years ago

(In reply to Chris Cooper [:coop] from comment #34)
> Now, to figure out how to invoke an apache restart via puppet...

AFAICT, this should already be happening. Here's the relevant log snippet from the initial puppet run on talos-linux64-ix-001:

Aug 10 19:57:43 talos-linux64-ix-001 puppet-agent[941]: (/Stage[main]/Packages::Httpd/Package[apache2]/ensure) ensure changed 'purged' to 'latest'
Aug 10 19:57:43 talos-linux64-ix-001 puppet-agent[941]: (/Stage[main]/Talos/Httpd::Config[talos.conf]/File[/etc/apache2/sites-enabled/talos.conf]/ensure) defined content as '{md5}be879b3a62da323700e7fa3badeb9acb'
Aug 10 19:57:45 talos-linux64-ix-001 puppet-agent[941]: (/Stage[main]/Httpd/Service[httpd]) Triggered 'refresh' from 1 events

Maybe the refresh action on Ubuntu isn't doing what we expect?

Chris Cooper [:coop] (he/him)

Comment 36

•

9 years ago

While I work on the puppet fix, I'm going to start re-imaging all of the affected slaves in the blocker list. It's sufficient to simply reboot each slave an extra time after re-imaging and prior to putting them back into service, and since we have concern about the size of these pools (bug 1193025), we should get as many back into service as possible.

Alin Selagea [:aselagea]

Updated

•

9 years ago

No longer blocks: talos-linux32-ix-022

Alin Selagea [:aselagea]

Updated

•

9 years ago

No longer blocks: talos-linux32-ix-026

Alin Selagea [:aselagea]

Updated

•

9 years ago

No longer blocks: talos-linux32-ix-008

Alin Selagea [:aselagea]

Updated

•

9 years ago

Blocks: talos-linux32-ix-026

Alin Selagea [:aselagea]

Updated

•

9 years ago

Blocks: talos-linux32-ix-008

Alin Selagea [:aselagea]

Updated

•

9 years ago

Blocks: talos-linux32-ix-022

Chris Cooper [:coop] (he/him)

Comment 37

•

9 years ago

The same fix was insufficient for the linux32 machines, so Ryan ended up disabling all the re-imaged linux32 slaves yesterday. Something is still wrong with the graphics on these machines: the tests won't start, and I can't connect via VNC like I can on the linux64 iX machines.

Rail: you set these up originally, and generally know more about linux. Can you have a look to see if it's something obvious?

Blocks: 1193025

Chris Cooper [:coop] (he/him)

Comment 38

•

9 years ago

Rail: see comment #37

Flags: needinfo?(rail)

Chris Cooper [:coop] (he/him)

Comment 39

•

9 years ago

(In reply to Chris Cooper [:coop] from comment #38)
> Rail: see comment #37

talos-linux32-ix-001 is currently attached to my staging master if you need a machine to poke at.

http://dev-master2.bb.releng.use1.mozilla.com:8045/buildslaves/talos-linux32-ix-001

Rail Aliiev [:rail]

Comment 40

•

9 years ago

I suspect that the following should fix the issue: https://gist.github.com/rail/a10d90a520181b44f88a

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 41

•

9 years ago

Attached patch notify_httpd.diff — Details — Splinter Review

Attachment #8648104 - Flags: review?(bugspam.Callek)

Justin Wood (:Callek)

Comment 42

•

9 years ago

Comment on attachment 8648104 [details] [diff] [review]
notify_httpd.diff

Review of attachment 8648104 [details] [diff] [review]:
-----------------------------------------------------------------

I'd be surprised if this fixes linux32 as described (given it had extra restarts and such) but is the right thing to do anyway.

Attachment #8648104 - Flags: review?(bugspam.Callek) → review+

Rail Aliiev [:rail]

Comment 43

•

9 years ago

I think this is what happens here:

1) puppet installs apache and starts it (ensure => running)
2) puppet deletes the file above, but apache is already running and has all configs in memory

Rail Aliiev [:rail]

Comment 44

•

9 years ago

Comment on attachment 8648104 [details] [diff] [review]
notify_httpd.diff

remote:   https://hg.mozilla.org/build/puppet/rev/5030c9f26096
remote:   https://hg.mozilla.org/build/puppet/rev/8a884831b65d

Attachment #8648104 - Flags: checked-in+

Rail Aliiev [:rail]

Comment 45

•

9 years ago

Found these:

[root@talos-linux32-ix-001 init]# tail  /var/log/upstart/x11.log                                                                                                                               
Current version of pixman: 0.28.2
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Fri Aug 14 09:53:22 2015
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
FATAL: Module nvidia_310 not found.


[root@talos-linux32-ix-001 init]# modprobe nvidia_310
FATAL: Module nvidia_310 not found.

The command above works fine on talos-linux32-ix-043

Rail Aliiev [:rail]

Comment 46

•

9 years ago

[root@talos-linux32-ix-043 ~]# ls /lib/modules/`uname -r`/updates/dkms/nvidia_310.ko 
/lib/modules/3.2.0-76-generic-pae/updates/dkms/nvidia_310.ko


[root@talos-linux32-ix-001 init]# ls /lib/modules/`uname -r`/updates/dkms/nvidia_310.ko                                                                                                        
ls: cannot access /lib/modules/3.2.0-76-generic-pae/updates/dkms/nvidia_310.ko: No such file or directory

Rail Aliiev [:rail]

Comment 47

•

9 years ago

[root@talos-linux32-ix-043 ~]# dkms status nvidia-310
nvidia-310, 310.32, 3.2.0-76-generic-pae, i686: installed
nvidia-310, 310.32, 3.5.0-18-generic, i686: installed

[root@talos-linux32-ix-001 init]# dkms status nvidia-310
nvidia-310, 310.32, 3.5.0-18-generic, i686: installed

getting there :)

Rail Aliiev [:rail]

Comment 48

•

9 years ago

Attached patch nvidia_needs_kernel.diff — Details — Splinter Review

It looks like nvidia-310 needs to be installed after we downgrade the latest kernel (3.5.0-18) to 3.2.0-76-generic-pae. Otherwise the dkms package adds itself only to 3.5.0-18. Also, there is no need to keep 3.5.0-18 around

Attachment #8648176 - Flags: review?(bugspam.Callek)

Justin Wood (:Callek)

Comment 49

•

9 years ago

Comment on attachment 8648176 [details] [diff] [review]
nvidia_needs_kernel.diff

Sure, why not -- I trust :rail here

Attachment #8648176 - Flags: review?(bugspam.Callek) → review+

Rail Aliiev [:rail]

Comment 50

•

9 years ago

Comment on attachment 8648176 [details] [diff] [review]
nvidia_needs_kernel.diff

remote:   https://hg.mozilla.org/build/puppet/rev/2ff9fe9a5e4f
remote:   https://hg.mozilla.org/build/puppet/rev/92324b6e7493

Rail Aliiev [:rail]

Updated

•

9 years ago

Attachment #8648176 - Flags: checked-in+

Chris Cooper [:coop] (he/him)

Comment 51

•

9 years ago

I'm re-imaging talos-linux32-ix-001 to take the fix.

Chris Cooper [:coop] (he/him)

Comment 52

•

9 years ago

(In reply to Chris Cooper (on PTO until Aug 31) [:coop] from comment #51)
> I'm re-imaging talos-linux32-ix-001 to take the fix.

No improvement, sadly.

Rail Aliiev [:rail]

Comment 53

•

9 years ago

Attached patch nvidia_rebuild_modules.diff — Details — Splinter Review

I dug into the issue a little bit deeper yesterday. tl;dr: dkms adds kernel modules only to current and latest kernel modules. In our case we the system is bootstrapped with 3.5.0-something kernel (coming from the xorg-edgers repo) and then we downgrade the kernel to 3.2.0-76 (or something). dkms ignores the final kernel because its version is less than the version of running kernel, so we have dkms installed nvidia modules only to current and latest (3.5.0) only.

As a temporary solution we can remove 3.5.0 from the repos, but the same issue may hit us in the future again.

It'd be probably better to create a helper script, which on boot checks if all available (dkms status) dkms modules are built/installed for the current kernel. 

The following patch addresses the issues above and I tested it on talos-linux32-ix-001.

Some highlights:

* remove all obsoleted kernel types regardless of suffix (we install -generic on 32-bit first, then switch to generic-pae)

* The scripts is invoked by upstart before x11 starts (on starting x11) to make sure we have modules before we start X

/var/log/upstart/nvidia-310.log contains all console output generated by the script.

Attachment #8648400 - Flags: review?(bugspam.Callek)

Justin Wood (:Callek)

Comment 54

•

9 years ago

Comment on attachment 8648400 [details] [diff] [review]
nvidia_rebuild_modules.diff

Review of attachment 8648400 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/packages/manifests/kernel.pp
@@ +66,5 @@
>                  if $kernelrelease == $kernel_ver and ! empty($obsolete_kernel_list) {
> +                    $obsolete_kernels  = suffix( prefix( $obsolete_kernel_list, 'linux-image-' ), '-generic')
> +                    $obsolete_headers  = suffix( prefix( $obsolete_kernel_list, 'linux-headers-' ), '-generic' )
> +                    $obsolete_kernels_pae  = suffix( prefix( $obsolete_kernel_list, 'linux-image-' ), '-generic-pae')
> +                    $obsolete_headers_pae  = suffix( prefix( $obsolete_kernel_list, 'linux-headers-' ), '-generic-pae' )

Looks like we no longer need http://hg.mozilla.org/build/puppet/annotate/27f57f829847/modules/packages/manifests/kernel.pp#l47

::: modules/packages/templates/nvidia_dkms.conf.erb
@@ +8,5 @@
> +        dkms add -m nvidia-<%= @nvidia_version %> -v <%= @nvidia_full_version %> -k `uname -r`
> +        /usr/lib/dkms/dkms_autoinstaller start || true
> +        modprobe nvidia-<%= @nvidia_version %> || true
> +    fi
> +end script

I don't understand any of these commands, (except the "am I installed" one), but I trust your testing here

Attachment #8648400 - Flags: review?(bugspam.Callek) → review+

Rail Aliiev [:rail]

Comment 55

•

9 years ago

Comment on attachment 8648400 [details] [diff] [review]
nvidia_rebuild_modules.diff

(In reply to Justin Wood (:Callek) from comment #54)
> Looks like we no longer need
> http://hg.mozilla.org/build/puppet/annotate/27f57f829847/modules/packages/
> manifests/kernel.pp#l47

I think we do, we still use the PAE kernel on 32-bit linux.

> I don't understand any of these commands, (except the "am I installed" one),
> but I trust your testing here

Hehe. :) Thanks.

remote:   https://hg.mozilla.org/build/puppet/rev/b24bf2e5c400
remote:   https://hg.mozilla.org/build/puppet/rev/cdef239d9df5

Attachment #8648400 - Flags: checked-in+

Rail Aliiev [:rail]

Comment 56

•

9 years ago

re-imaging talos-linux32-ix-001 now...

Rail Aliiev [:rail]

Comment 57

•

9 years ago

looking much better now:

[root@talos-linux32-ix-001 ~]# dkms status
nvidia-310, 310.32, 3.2.0-76-generic-pae, i686: installed
v4l2loopback, 0.6.1, 3.2.0-76-generic-pae, i686: installed
[root@talos-linux32-ix-001 ~]# dpkg -l |grep linux-image |grep ^ii
ii  linux-image-3.2.0-76-generic-pae       3.2.0-76.111                                                            Linux kernel image for version 3.2.0 on 32 bit x86 SMP
ii  linux-image-generic-pae                3.2.0.76.90                                                             Generic Linux kernel image

I forced some tests to see the greens!

Rail Aliiev [:rail]

Comment 58

•

9 years ago

I see 3 green results in a row. \o/ Assuming it's fixed now.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Comment 59

•

9 years ago

 \o/  \ /
  |    |
 / \  /o\

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

screenshot of bad slave 9 years ago Armen [:armenzg] 217.17 KB, image/png		Details
Can't find localhost/getInfo.html 9 years ago Chris Cooper [:coop] (he/him) 749.27 KB, image/png		Details
notify_httpd.diff 9 years ago Rail Aliiev [:rail] 746 bytes, patch	Callek : review+ rail : checked-in+	Details \| Diff \| Splinter Review
nvidia_needs_kernel.diff 9 years ago Rail Aliiev [:rail] 1.68 KB, patch	Callek : review+ rail : checked-in+	Details \| Diff \| Splinter Review
nvidia_rebuild_modules.diff 9 years ago Rail Aliiev [:rail] 4.13 KB, patch	Callek : review+ rail : checked-in+	Details \| Diff \| Splinter Review