Closed Bug 1141416 Opened 9 years ago Closed 9 years ago

Fix the slaves broken by talos's inability to deploy an update

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Attachments

(5 files)

Not sure what will be an actual successful fix, since as I remember the "fix" for when we last added a new talos chunk, when we tried to fix the problem by reimaging the slaves that didn't pick up the new version of talos, we wound up with slaves with an even more broken version of talos.
Depends on: 1112773
looks like 1112773 just needs to land. poked bug
1112773 is resolved. I suspect this is now fixed as we *should* be using the cloned checkout of talos and not the talos module baked into the venv.
(In reply to Jordan Lund (:jlund) from comment #2)
> 1112773 is resolved. I suspect this is now fixed as we *should* be using the
> cloned checkout of talos and not the talos module baked into the venv.

I will go through the list of Linux slaves in this bug tomorrow and re-image them all. Whee!
I've re-imaged talos-linux32-ix-00[1,3,8]. 003 failed the first job it took:

http://buildbot-master104.bb.releng.scl3.mozilla.com:8201/builders/Ubuntu%20HW%2012.04%20try%20talos%20g2/builds/392

I'm holding off re-imaging the rest until we have at least a few successful runs on this first batch.
(In reply to Chris Cooper [:coop] from comment #4)
> I've re-imaged talos-linux32-ix-00[1,3,8]. 003 failed the first job it took:

All of these are timing out after an hour without output while trying to run talos. :/
(In reply to Chris Cooper [:coop] from comment #5)
> (In reply to Chris Cooper [:coop] from comment #4)
> > I've re-imaged talos-linux32-ix-00[1,3,8]. 003 failed the first job it took:
> 
> All of these are timing out after an hour without output while trying to run
> talos. :/

for those with access, here are the three failed jobs:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux32-ix-003

here is what I think is happening:

1) the first job was a try job that used a pin mh rev (321d9dcec7b2) that comes before my fix for this bug[1]. Which means that we populated the python venv with a talos installation and so we fail like before bc it's using the venv not the talos repo:

11:50:00     INFO - Calling ['/builds/slave/test/build/venv/bin/talos', '--noisy', # ... etc

2) then, even though the next job using the current m-i mh pin with the required fix, we end up with a corrupt venv that has old talos stale data. Essentially, we call PerfConfig/run_tests.py directly from the 'bad' python interpreter

13:42:36     INFO - Calling ['/builds/slave/test/build/venv/bin/python', '/builds/slave/test/build/talos_repo/talos/PerfConfigurator.py', # etc

In which case, we would have to either clobber the venv before each job or wait until we are 'unlikely' to have someone push to try with a really old mh pin.

jmaher, does that sound about right to you?

[1] http://hg.mozilla.org/build/mozharness/rev/f4520ff7c234
Flags: needinfo?(jmaher)
this sounds very plausible to me.  Should we wait ~1 week and then give it a go?
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #7)
> this sounds very plausible to me.  Should we wait ~1 week and then give it a
> go?

Does it affect the calculus here that some of the failures are happening on mozilla-inbound as well?

e.g. http://buildbot-master104.bb.releng.scl3.mozilla.com:8201/builders/Ubuntu%20HW%2012.04%20mozilla-inbound%20talos%20g1/builds/1436
I don't know how to view that link above or get more information.

Do we still have talos slaves broken on recent inbound/fx-team/mozilla-central builds?

If so, then we need to take one of those failing slaves out and investigate it in more detail.
(In reply to Joel Maher (:jmaher) from comment #9)
> I don't know how to view that link above or get more information.
> 
> Do we still have talos slaves broken on recent
> inbound/fx-team/mozilla-central builds?
> 
> If so, then we need to take one of those failing slaves out and investigate
> it in more detail.

hrm, so the m-i job came after the old try job and in comment 6 I was suggesting that they shared the same venv (/builds/slave/test/build/venv) which carried talos packages from the try job.
interesting- why isn't the venv updated?
(In reply to Joel Maher (:jmaher) from comment #11)
> interesting- why isn't the venv updated?

I think it is updated with the new modules we added but what I'm suggesting is that it will still have the 'talos' virtualenv module in it and I suspected that when we call the 'cloned talos checkout' scripts from the venv, we end up using bits from the 'venv talos' package. that was my original guess granted I'm not familiar with talos and the setup.

If we don't have any better ideas, I think it is worth just trying this again on one machine at the end of this week on freshly imaged machines in hopes that we don't have any try runs using an old mh rev still.

I can't track down a public link for the first job anymore (the try old mh based)

here is the second job (the new mh based one that uses cloned talos): https://treeherder.mozilla.org/logviewer.html#?job_id=11105391&repo=mozilla-inbound
We tried it with talos-linux64-ix-002 this morning, freshly reimaged after a loan, first job it took was on mozilla-inbound, http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1435683703/mozilla-inbound_ubuntu64_hw_test-tp5o-bm105-tests1-linux-build1089.txt.gz
(In reply to Phil Ringnalda (:philor) from comment #13)
> We tried it with talos-linux64-ix-002 this morning, freshly reimaged after a
> loan, first job it took was on mozilla-inbound,
> http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> inbound-linux64/1435683703/mozilla-inbound_ubuntu64_hw_test-tp5o-bm105-
> tests1-linux-build1089.txt.gz

hrm, and it is using the talos repo. Looks like there is something else at play. Maybe the issue was never that talos wasn't updating..

10:31:54     INFO - Calling ['/builds/slave/test/build/venv/bin/python', '/builds/slave/test/build/talos_repo/talos/PerfConfigurator.py'

joel, can we set you or someone you recommend with a freshly imaged slave again and poke around?
Flags: needinfo?(jmaher)
There was a time when the issue was that talos wasn't updating, the ones from at least as recently as August 2014  were failing when they tried to run a newly added suite because as far as the talos they were running was concerned that suite did not exist. But, it has been more than a year since we last put a Linux talos slave back in service, so practically anything could have rotted with the image in the meantime.
Please get me a loaner, and I will look at this
Flags: needinfo?(jmaher)
Depends on: 1181250
(In reply to Joel Maher (:jmaher) from comment #16)
> Please get me a loaner, and I will look at this

Grabbed talos-linux64-ix-002 for Joel. 

He may still need some help here to work through puppet issues if we can't find something with the harnesses.
I am not sure what to look for here. I am able to run tests successfully, even tests which are *brand new* and not defined in a production environment.  With that said, looking at the logs, we seem to get PerfConfigurator.py and run_tests.py from the checkout, not the venv.  

Maybe we have an old venv talos sitting around?  We need to use a python from a venv to access modules.
I looked at the log that philor pasted. It seems that we clobber and use the right thing for the venv.

I get the odd feeling that we could see something through VNC.

I would not mind having a look at a loaner.

10:31:55     INFO -  DEBUG : command line: /builds/slave/test/build/application/firefox/firefox -profile /tmp/tmprFZCOQ/profile http://localhost/getInfo.html

command timed out: 3600 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'tp5o', '--add-option', '--webServer,localhost', '--branch-name', 'Mozilla-Inbound-Non-PGO', '--system-bits', '64', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Mozilla-Inbound-Non-PGO'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=3751.827034
> I would not mind having a look at a loaner.
> 

thanks Armen. I'm sure Joel wouldn't mind you using his slave https://bugzil.la/1181250 talos-linux64-ix-002 if he is not currently using it. passwords are likely the same as your last loaner. I double checked and you are still on the vpn loaner list.
> I would not mind having a look at a loaner.
> 


hey armen, have you had a chance to look at this loaner?
Flags: needinfo?(armenzg)
Not yet. I will leave the NI in place for next week.
jlund, could you please have a look at a working machine and check for the permission of these files?

[cltbld@talos-linux64-ix-002 test]$ ls -l /usr/lib/mozilla/plugins/
total 364
-rw-r--r-- 1 root root   6048 Jul  9  2012 librhythmbox-itms-detection-plugin.so
-rw-r--r-- 1 root root 100720 Apr  2  2012 libtotem-cone-plugin.so
-rw-r--r-- 1 root root 105424 Apr  2  2012 libtotem-gmp-plugin.so
-rw-r--r-- 1 root root  72040 Apr  2  2012 libtotem-mully-plugin.so
-rw-r--r-- 1 root root  80560 Apr  2  2012 libtotem-narrowspace-plugin.so


I see this in the output of a talos run:
12:20:55     INFO -  DEBUG : command line: /builds/slave/test/build/application/firefox/firefox -profile /tmp/tmpQZ2ppq/profile http://localhost/getInfo.html
12:20:56     INFO -  LoadPlugin: failed to initialize shared library libXt.so [libXt.so: cannot open shared object file: No such file or directory]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library libXext.so [libXext.so: cannot open shared object file: No such file or directory]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/librhythmbox-itms-detection-plugin.so [/usr/lib/mozilla/plugins/librhythmbox-itms-detection-plugin.so: wrong ELF class: ELFCLASS64]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-cone-plugin.so [/usr/lib/mozilla/plugins/libtotem-cone-plugin.so: wrong ELF class: ELFCLASS64]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-mully-plugin.so [/usr/lib/mozilla/plugins/libtotem-mully-plugin.so: wrong ELF class: ELFCLASS64]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-gmp-plugin.so [/usr/lib/mozilla/plugins/libtotem-gmp-plugin.so: wrong ELF class: ELFCLASS64]
12:20:56     INFO -  LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-narrowspace-plugin.so [/usr/lib/mozilla/plugins/libtotem-narrowspace-plugin.so: wrong ELF class: ELFCLASS64]
12:20:59     INFO -  __metrics	Screen width/height:1600/1200
12:20:59     INFO -  	colorDepth:24
12:20:59     INFO -  	Browser inner width/height: 1024/697
12:20:59     INFO -  __metrics
12:20:59     INFO -  JavaScript error: resource:///modules/WebappManager.jsm, line 48: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver]
12:21:01     INFO -  DEBUG : initialized firefox
12:21:01     INFO -  DEBUG : command line: /builds/slave/test/build/application/firefox/firefox -profile /tmp/tmpQZ2ppq/profile -tp file:/builds/slave/test/build/talos_repo/talos/page_load_test/canvasmark/canvasmark.manifest -tpchrome -tpnoisy -tpcycles 5 -tppagecycles 1
Flags: needinfo?(armenzg)
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #23)
> jlund, could you please have a look at a working machine and check for the
> permission of these files?
> 
> [cltbld@talos-linux64-ix-002 test]$ ls -l /usr/lib/mozilla/plugins/
> total 364
> -rw-r--r-- 1 root root   6048 Jul  9  2012
> librhythmbox-itms-detection-plugin.so
> -rw-r--r-- 1 root root 100720 Apr  2  2012 libtotem-cone-plugin.so
> -rw-r--r-- 1 root root 105424 Apr  2  2012 libtotem-gmp-plugin.so
> -rw-r--r-- 1 root root  72040 Apr  2  2012 libtotem-mully-plugin.so
> -rw-r--r-- 1 root root  80560 Apr  2  2012 libtotem-narrowspace-plugin.so

On talos-linux64-ix-085, I see the same thing:

[cltbld@talos-linux64-ix-085 ~]$ ls -l /usr/lib/mozilla/plugins/
total 364
-rw-r--r-- 1 root root   6048 Jul  9  2012 librhythmbox-itms-detection-plugin.so
-rw-r--r-- 1 root root 100720 Apr  2  2012 libtotem-cone-plugin.so
-rw-r--r-- 1 root root 105424 Apr  2  2012 libtotem-gmp-plugin.so
-rw-r--r-- 1 root root  72040 Apr  2  2012 libtotem-mully-plugin.so
-rw-r--r-- 1 root root  80560 Apr  2  2012 libtotem-narrowspace-plugin.so
Attached image screenshot of bad slave
I don't know if it is related. Just attaching the screenshot of one of the machines with issues.
The bar on the left is not there.
We still have a bad slave running (bad svg run):
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1438718104/b2g-inbound_ubuntu32_hw_test-svgr-bm104-tests1-linux-build69.txt.gz

Here's a good svg run:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1438720319/b2g-inbound_ubuntu32_hw_test-svgr-bm105-tests1-linux-build226.txt.gz

From comparing the logs and some string manipulation (:%s/\d\d:\d\d:\d\d//g) the only difference I see on a bad run is this:
warning: s3-us-west-2.amazonaws.com certificate with fingerprint d2::be:33:1e:fa:6b:e2:9f:51:e1:8a:ab::64:e7 not verified (check hostfingerprints or web.cacerts config setting)

Nothing to go there.
I believe that runner is still running (from /var/log/syslog) about the time that the job was running [1]

Some interesting symptoms:
* talos-linux32-003 was disabled on slavealloc yet taking jobs on production
* About 13:55:09 today it was running a buildbot job and runner was running [1]
* I could not stay connected to that machine for a while. Possible runner kills the ssh server
* The desktop environment looks messed up (see attachment)
** Perhaps runner is killing it or restarting it

These symptoms are not faced by talos-linux64-ix-002 which jmaher was using but it might have been in such state before it was loaned. Perhaps the cleaning process of a loaner fixes the underlying issue.

Could someone please investigate talos-linux32-003 and determine what is going on?
I would have tried to reboot the machine but I would not know if I would be clearing up the issue.

Do we have metrics for this host?

[1]
Aug  5 11:54:59 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-checkout_tools", "result": "OK"}
Aug  5 11:55:00 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-config_hgrc", "result": "RUNNING"}
Aug  5 11:55:01 talos-linux32-ix-003 0-config_hgrc: starting (max time 600s)
Aug  5 11:55:02 talos-linux32-ix-003 0-config_hgrc: OK
Aug  5 11:55:02 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-config_hgrc", "result": "OK"}
Aug  5 11:55:03 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "RUNNING"}
Aug  5 11:55:04 talos-linux32-ix-003 1-cleanslate: starting (max time 600s)
Aug  5 11:55:05 talos-linux32-ix-003 1-cleanslate: OK
Aug  5 11:55:05 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "OK"}
Aug  5 11:55:06 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanup", "result": "RUNNING"}
Aug  5 11:55:07 talos-linux32-ix-003 1-cleanup: starting (max time 600s)
Aug  5 11:55:08 talos-linux32-ix-003 1-cleanup: OK
Aug  5 11:55:08 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanup", "result": "OK"}
Aug  5 11:55:09 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"}
Aug  5 11:55:10 talos-linux32-ix-003 1-mig_agent: starting (max time 600s)
Aug  5 11:55:17 talos-linux32-ix-003 1-mig_agent: OK
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #27)
> * I could not stay connected to that machine for a while. Possible runner
> kills the ssh server
> * The desktop environment looks messed up (see attachment)
> ** Perhaps runner is killing it or restarting it
> 
> These symptoms are not faced by talos-linux64-ix-002 which jmaher was using
> but it might have been in such state before it was loaned. Perhaps the
> cleaning process of a loaner fixes the underlying issue.

This is expected. buildbot is started via a runner task after it has completed all the pre-flight tasks. The list of tasks is removed as part of the loan, so jmaher wouldn't have seen this.

> Could someone please investigate talos-linux32-003 and determine what is
> going on?
> I would have tried to reboot the machine but I would not know if I would be
> clearing up the issue.

One of the other tasks is cleanslate. cleanslate tries to build a list of allowed running processes and actively terminates things that are not in that list. I think this is likely where this is falling down, i.e. I suspect that cleanslate is killing off things like ssh connections, VNC sessions, possibly even the graphics context required for the tests themselves.

> Do we have metrics for this host?

We have logs in papertrail, but not sure you have access:

https://papertrailapp.com/systems/talos-linux32-ix-003/events
I discussed this yesterday with Armen and Rail. From the screenshot it seems that the window manager (Unity)
is not running, so it's either not starting, crashing, or being killed. I'm going to try connecting a talos-linux64 machine to my tests master in staging today and running it with cleanslate disabled to see whether we end up with a functional window manager.
I have talos-linux64-ix-004 running against my staging tests master with cleanslate disabled, and it's actually running the test instead of timing out. That's a pretty clear indication to me that cleanslate is being too aggressive.

I'm going to dump the process list from talos-linux64-ix-004 running in this state and compare it against the process list of a machine running cleanslate to narrow down the processes we should be white-listing.
...and now I have no idea what's going on. Every single talos test I've run against my tests server has passed, regardless of whether cleanslate has been disabled or not.

I'm going to try re-imaging talos-linux64-ix-004 to see whether I can recreate the standard bustage. I'm way too pessimistic to believe this has actually been fixed.
Does the window manager get fixed?
If you VNC into a machine after your changes, can you see Firefox/talos running?
After a fresh re-install, the machine (talos-linux64-ix-004) will hang in talos as shown in the attached screenshot. This indicates to me that apache is not starting correctly, or the machine is being marked as ready for service *before* apache is properly installed.

On a hunch, I tried simply rebooting the machine. It took the next job and and had no trouble finding localhost/getInfo.html. This indicates to me that the fix is likely in puppet, i.e. verifying that apache is properly installed before the final reboot that would return the slave to production, or at the very worst simply adding another reboot to that cycle.

I'm going to try to replicate this behavior on another slave before I dive into puppet.
(In reply to Chris Cooper [:coop] from comment #33) 
> I'm going to try to replicate this behavior on another slave before I dive
> into puppet.

Was able to replicate on talos-linux64-ix-001 after a re-image, and verified that running "service apache2 restart" was enough to start serving content properly.

Now, to figure out how to invoke an apache restart via puppet...
(In reply to Chris Cooper [:coop] from comment #34)
> Now, to figure out how to invoke an apache restart via puppet...

AFAICT, this should already be happening. Here's the relevant log snippet from the initial puppet run on talos-linux64-ix-001:

Aug 10 19:57:43 talos-linux64-ix-001 puppet-agent[941]: (/Stage[main]/Packages::Httpd/Package[apache2]/ensure) ensure changed 'purged' to 'latest'
Aug 10 19:57:43 talos-linux64-ix-001 puppet-agent[941]: (/Stage[main]/Talos/Httpd::Config[talos.conf]/File[/etc/apache2/sites-enabled/talos.conf]/ensure) defined content as '{md5}be879b3a62da323700e7fa3badeb9acb'
Aug 10 19:57:45 talos-linux64-ix-001 puppet-agent[941]: (/Stage[main]/Httpd/Service[httpd]) Triggered 'refresh' from 1 events

Maybe the refresh action on Ubuntu isn't doing what we expect?
While I work on the puppet fix, I'm going to start re-imaging all of the affected slaves in the blocker list. It's sufficient to simply reboot each slave an extra time after re-imaging and prior to putting them back into service, and since we have concern about the size of these pools (bug 1193025), we should get as many back into service as possible.
No longer blocks: talos-linux32-ix-022
No longer blocks: talos-linux32-ix-026
No longer blocks: talos-linux32-ix-008
The same fix was insufficient for the linux32 machines, so Ryan ended up disabling all the re-imaged linux32 slaves yesterday. Something is still wrong with the graphics on these machines: the tests won't start, and I can't connect via VNC like I can on the linux64 iX machines.

Rail: you set these up originally, and generally know more about linux. Can you have a look to see if it's something obvious?
Blocks: 1193025
Rail: see comment #37
Flags: needinfo?(rail)
(In reply to Chris Cooper [:coop] from comment #38)
> Rail: see comment #37

talos-linux32-ix-001 is currently attached to my staging master if you need a machine to poke at.

http://dev-master2.bb.releng.use1.mozilla.com:8045/buildslaves/talos-linux32-ix-001
I suspect that the following should fix the issue: https://gist.github.com/rail/a10d90a520181b44f88a
Flags: needinfo?(rail)
Attachment #8648104 - Flags: review?(bugspam.Callek)
Comment on attachment 8648104 [details] [diff] [review]
notify_httpd.diff

Review of attachment 8648104 [details] [diff] [review]:
-----------------------------------------------------------------

I'd be surprised if this fixes linux32 as described (given it had extra restarts and such) but is the right thing to do anyway.
Attachment #8648104 - Flags: review?(bugspam.Callek) → review+
I think this is what happens here:

1) puppet installs apache and starts it (ensure => running)
2) puppet deletes the file above, but apache is already running and has all configs in memory
Found these:

[root@talos-linux32-ix-001 init]# tail  /var/log/upstart/x11.log                                                                                                                               
Current version of pixman: 0.28.2
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Fri Aug 14 09:53:22 2015
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
FATAL: Module nvidia_310 not found.


[root@talos-linux32-ix-001 init]# modprobe nvidia_310
FATAL: Module nvidia_310 not found.

The command above works fine on talos-linux32-ix-043
[root@talos-linux32-ix-043 ~]# ls /lib/modules/`uname -r`/updates/dkms/nvidia_310.ko 
/lib/modules/3.2.0-76-generic-pae/updates/dkms/nvidia_310.ko


[root@talos-linux32-ix-001 init]# ls /lib/modules/`uname -r`/updates/dkms/nvidia_310.ko                                                                                                        
ls: cannot access /lib/modules/3.2.0-76-generic-pae/updates/dkms/nvidia_310.ko: No such file or directory
[root@talos-linux32-ix-043 ~]# dkms status nvidia-310
nvidia-310, 310.32, 3.2.0-76-generic-pae, i686: installed
nvidia-310, 310.32, 3.5.0-18-generic, i686: installed

[root@talos-linux32-ix-001 init]# dkms status nvidia-310
nvidia-310, 310.32, 3.5.0-18-generic, i686: installed

getting there :)
It looks like nvidia-310 needs to be installed after we downgrade the latest kernel (3.5.0-18) to 3.2.0-76-generic-pae. Otherwise the dkms package adds itself only to 3.5.0-18. Also, there is no need to keep 3.5.0-18 around
Attachment #8648176 - Flags: review?(bugspam.Callek)
Comment on attachment 8648176 [details] [diff] [review]
nvidia_needs_kernel.diff

Sure, why not -- I trust :rail here
Attachment #8648176 - Flags: review?(bugspam.Callek) → review+
Attachment #8648176 - Flags: checked-in+
I'm re-imaging talos-linux32-ix-001 to take the fix.
(In reply to Chris Cooper (on PTO until Aug 31) [:coop] from comment #51)
> I'm re-imaging talos-linux32-ix-001 to take the fix.

No improvement, sadly.
I dug into the issue a little bit deeper yesterday. tl;dr: dkms adds kernel modules only to current and latest kernel modules. In our case we the system is bootstrapped with 3.5.0-something kernel (coming from the xorg-edgers repo) and then we downgrade the kernel to 3.2.0-76 (or something). dkms ignores the final kernel because its version is less than the version of running kernel, so we have dkms installed nvidia modules only to current and latest (3.5.0) only.

As a temporary solution we can remove 3.5.0 from the repos, but the same issue may hit us in the future again.

It'd be probably better to create a helper script, which on boot checks if all available (dkms status) dkms modules are built/installed for the current kernel. 

The following patch addresses the issues above and I tested it on talos-linux32-ix-001.

Some highlights:

* remove all obsoleted kernel types regardless of suffix (we install -generic on 32-bit first, then switch to generic-pae)

* The scripts is invoked by upstart before x11 starts (on starting x11) to make sure we have modules before we start X

/var/log/upstart/nvidia-310.log contains all console output generated by the script.
Attachment #8648400 - Flags: review?(bugspam.Callek)
Comment on attachment 8648400 [details] [diff] [review]
nvidia_rebuild_modules.diff

Review of attachment 8648400 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/packages/manifests/kernel.pp
@@ +66,5 @@
>                  if $kernelrelease == $kernel_ver and ! empty($obsolete_kernel_list) {
> +                    $obsolete_kernels  = suffix( prefix( $obsolete_kernel_list, 'linux-image-' ), '-generic')
> +                    $obsolete_headers  = suffix( prefix( $obsolete_kernel_list, 'linux-headers-' ), '-generic' )
> +                    $obsolete_kernels_pae  = suffix( prefix( $obsolete_kernel_list, 'linux-image-' ), '-generic-pae')
> +                    $obsolete_headers_pae  = suffix( prefix( $obsolete_kernel_list, 'linux-headers-' ), '-generic-pae' )

Looks like we no longer need http://hg.mozilla.org/build/puppet/annotate/27f57f829847/modules/packages/manifests/kernel.pp#l47

::: modules/packages/templates/nvidia_dkms.conf.erb
@@ +8,5 @@
> +        dkms add -m nvidia-<%= @nvidia_version %> -v <%= @nvidia_full_version %> -k `uname -r`
> +        /usr/lib/dkms/dkms_autoinstaller start || true
> +        modprobe nvidia-<%= @nvidia_version %> || true
> +    fi
> +end script

I don't understand any of these commands, (except the "am I installed" one), but I trust your testing here
Attachment #8648400 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8648400 [details] [diff] [review]
nvidia_rebuild_modules.diff

(In reply to Justin Wood (:Callek) from comment #54)
> Looks like we no longer need
> http://hg.mozilla.org/build/puppet/annotate/27f57f829847/modules/packages/
> manifests/kernel.pp#l47

I think we do, we still use the PAE kernel on 32-bit linux.

> I don't understand any of these commands, (except the "am I installed" one),
> but I trust your testing here

Hehe. :) Thanks.

remote:   https://hg.mozilla.org/build/puppet/rev/b24bf2e5c400
remote:   https://hg.mozilla.org/build/puppet/rev/cdef239d9df5
Attachment #8648400 - Flags: checked-in+
re-imaging talos-linux32-ix-001 now...
looking much better now:

[root@talos-linux32-ix-001 ~]# dkms status
nvidia-310, 310.32, 3.2.0-76-generic-pae, i686: installed
v4l2loopback, 0.6.1, 3.2.0-76-generic-pae, i686: installed
[root@talos-linux32-ix-001 ~]# dpkg -l |grep linux-image |grep ^ii
ii  linux-image-3.2.0-76-generic-pae       3.2.0-76.111                                                            Linux kernel image for version 3.2.0 on 32 bit x86 SMP
ii  linux-image-generic-pae                3.2.0.76.90                                                             Generic Linux kernel image

I forced some tests to see the greens!
I see 3 green results in a row. \o/ Assuming it's fixed now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
 \o/  \ /
  |    |
 / \  /o\
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: