Closed
Bug 1141416
Opened 9 years ago
Closed 9 years ago
Fix the slaves broken by talos's inability to deploy an update
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
Details
Attachments
(5 files)
217.17 KB,
image/png
|
Details | |
749.27 KB,
image/png
|
Details | |
746 bytes,
patch
|
Callek
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
1.68 KB,
patch
|
Callek
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
4.13 KB,
patch
|
Callek
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
Not sure what will be an actual successful fix, since as I remember the "fix" for when we last added a new talos chunk, when we tried to fix the problem by reimaging the slaves that didn't pick up the new version of talos, we wound up with slaves with an even more broken version of talos.
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux32-ix-008
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-003
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-017
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-092
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-055
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-118
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-099
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-004
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-008
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-027
Comment 1•9 years ago
|
||
looks like 1112773 just needs to land. poked bug
Updated•9 years ago
|
Blocks: talos-linux32-ix-026
Updated•9 years ago
|
Blocks: talos-linux32-ix-001
Reporter | ||
Updated•9 years ago
|
Blocks: t-snow-r4-0005
Reporter | ||
Updated•9 years ago
|
Blocks: t-snow-r4-0025
Reporter | ||
Updated•9 years ago
|
Blocks: t-snow-r4-0048
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux32-ix-022
Reporter | ||
Updated•9 years ago
|
Blocks: t-snow-r4-0070
Reporter | ||
Updated•9 years ago
|
Blocks: t-snow-r4-0063
Reporter | ||
Updated•9 years ago
|
Blocks: t-snow-r4-0074
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-080
Comment 2•9 years ago
|
||
1112773 is resolved. I suspect this is now fixed as we *should* be using the cloned checkout of talos and not the talos module baked into the venv.
Comment 3•9 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #2) > 1112773 is resolved. I suspect this is now fixed as we *should* be using the > cloned checkout of talos and not the talos module baked into the venv. I will go through the list of Linux slaves in this bug tomorrow and re-image them all. Whee!
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-001
Comment 4•9 years ago
|
||
I've re-imaged talos-linux32-ix-00[1,3,8]. 003 failed the first job it took: http://buildbot-master104.bb.releng.scl3.mozilla.com:8201/builders/Ubuntu%20HW%2012.04%20try%20talos%20g2/builds/392 I'm holding off re-imaging the rest until we have at least a few successful runs on this first batch.
Comment 5•9 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #4) > I've re-imaged talos-linux32-ix-00[1,3,8]. 003 failed the first job it took: All of these are timing out after an hour without output while trying to run talos. :/
Comment 6•9 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #5) > (In reply to Chris Cooper [:coop] from comment #4) > > I've re-imaged talos-linux32-ix-00[1,3,8]. 003 failed the first job it took: > > All of these are timing out after an hour without output while trying to run > talos. :/ for those with access, here are the three failed jobs: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=talos-linux32-ix-003 here is what I think is happening: 1) the first job was a try job that used a pin mh rev (321d9dcec7b2) that comes before my fix for this bug[1]. Which means that we populated the python venv with a talos installation and so we fail like before bc it's using the venv not the talos repo: 11:50:00 INFO - Calling ['/builds/slave/test/build/venv/bin/talos', '--noisy', # ... etc 2) then, even though the next job using the current m-i mh pin with the required fix, we end up with a corrupt venv that has old talos stale data. Essentially, we call PerfConfig/run_tests.py directly from the 'bad' python interpreter 13:42:36 INFO - Calling ['/builds/slave/test/build/venv/bin/python', '/builds/slave/test/build/talos_repo/talos/PerfConfigurator.py', # etc In which case, we would have to either clobber the venv before each job or wait until we are 'unlikely' to have someone push to try with a really old mh pin. jmaher, does that sound about right to you? [1] http://hg.mozilla.org/build/mozharness/rev/f4520ff7c234
Flags: needinfo?(jmaher)
Comment 7•9 years ago
|
||
this sounds very plausible to me. Should we wait ~1 week and then give it a go?
Flags: needinfo?(jmaher)
Comment 8•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #7) > this sounds very plausible to me. Should we wait ~1 week and then give it a > go? Does it affect the calculus here that some of the failures are happening on mozilla-inbound as well? e.g. http://buildbot-master104.bb.releng.scl3.mozilla.com:8201/builders/Ubuntu%20HW%2012.04%20mozilla-inbound%20talos%20g1/builds/1436
Updated•9 years ago
|
Blocks: talos-linux32-ix-003
Comment 9•9 years ago
|
||
I don't know how to view that link above or get more information. Do we still have talos slaves broken on recent inbound/fx-team/mozilla-central builds? If so, then we need to take one of those failing slaves out and investigate it in more detail.
Comment 10•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #9) > I don't know how to view that link above or get more information. > > Do we still have talos slaves broken on recent > inbound/fx-team/mozilla-central builds? > > If so, then we need to take one of those failing slaves out and investigate > it in more detail. hrm, so the m-i job came after the old try job and in comment 6 I was suggesting that they shared the same venv (/builds/slave/test/build/venv) which carried talos packages from the try job.
Comment 11•9 years ago
|
||
interesting- why isn't the venv updated?
Reporter | ||
Updated•9 years ago
|
Blocks: talos-linux64-ix-002
Comment 12•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #11) > interesting- why isn't the venv updated? I think it is updated with the new modules we added but what I'm suggesting is that it will still have the 'talos' virtualenv module in it and I suspected that when we call the 'cloned talos checkout' scripts from the venv, we end up using bits from the 'venv talos' package. that was my original guess granted I'm not familiar with talos and the setup. If we don't have any better ideas, I think it is worth just trying this again on one machine at the end of this week on freshly imaged machines in hopes that we don't have any try runs using an old mh rev still. I can't track down a public link for the first job anymore (the try old mh based) here is the second job (the new mh based one that uses cloned talos): https://treeherder.mozilla.org/logviewer.html#?job_id=11105391&repo=mozilla-inbound
Reporter | ||
Comment 13•9 years ago
|
||
We tried it with talos-linux64-ix-002 this morning, freshly reimaged after a loan, first job it took was on mozilla-inbound, http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1435683703/mozilla-inbound_ubuntu64_hw_test-tp5o-bm105-tests1-linux-build1089.txt.gz
Comment 14•9 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #13) > We tried it with talos-linux64-ix-002 this morning, freshly reimaged after a > loan, first job it took was on mozilla-inbound, > http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla- > inbound-linux64/1435683703/mozilla-inbound_ubuntu64_hw_test-tp5o-bm105- > tests1-linux-build1089.txt.gz hrm, and it is using the talos repo. Looks like there is something else at play. Maybe the issue was never that talos wasn't updating.. 10:31:54 INFO - Calling ['/builds/slave/test/build/venv/bin/python', '/builds/slave/test/build/talos_repo/talos/PerfConfigurator.py' joel, can we set you or someone you recommend with a freshly imaged slave again and poke around?
Flags: needinfo?(jmaher)
Reporter | ||
Comment 15•9 years ago
|
||
There was a time when the issue was that talos wasn't updating, the ones from at least as recently as August 2014 were failing when they tried to run a newly added suite because as far as the talos they were running was concerned that suite did not exist. But, it has been more than a year since we last put a Linux talos slave back in service, so practically anything could have rotted with the image in the meantime.
Comment 17•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #16) > Please get me a loaner, and I will look at this Grabbed talos-linux64-ix-002 for Joel. He may still need some help here to work through puppet issues if we can't find something with the harnesses.
Comment 18•9 years ago
|
||
I am not sure what to look for here. I am able to run tests successfully, even tests which are *brand new* and not defined in a production environment. With that said, looking at the logs, we seem to get PerfConfigurator.py and run_tests.py from the checkout, not the venv. Maybe we have an old venv talos sitting around? We need to use a python from a venv to access modules.
Comment 19•9 years ago
|
||
I looked at the log that philor pasted. It seems that we clobber and use the right thing for the venv. I get the odd feeling that we could see something through VNC. I would not mind having a look at a loaner. 10:31:55 INFO - DEBUG : command line: /builds/slave/test/build/application/firefox/firefox -profile /tmp/tmprFZCOQ/profile http://localhost/getInfo.html command timed out: 3600 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/talos_script.py', '--suite', 'tp5o', '--add-option', '--webServer,localhost', '--branch-name', 'Mozilla-Inbound-Non-PGO', '--system-bits', '64', '--cfg', 'talos/linux_config.py', '--download-symbols', 'ondemand', '--use-talos-json', '--blob-upload-branch', 'Mozilla-Inbound-Non-PGO'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=3751.827034
Comment 20•9 years ago
|
||
> I would not mind having a look at a loaner. > thanks Armen. I'm sure Joel wouldn't mind you using his slave https://bugzil.la/1181250 talos-linux64-ix-002 if he is not currently using it. passwords are likely the same as your last loaner. I double checked and you are still on the vpn loaner list.
Comment 21•9 years ago
|
||
> I would not mind having a look at a loaner.
>
hey armen, have you had a chance to look at this loaner?
Flags: needinfo?(armenzg)
Comment 22•9 years ago
|
||
Not yet. I will leave the NI in place for next week.
Comment 23•9 years ago
|
||
jlund, could you please have a look at a working machine and check for the permission of these files? [cltbld@talos-linux64-ix-002 test]$ ls -l /usr/lib/mozilla/plugins/ total 364 -rw-r--r-- 1 root root 6048 Jul 9 2012 librhythmbox-itms-detection-plugin.so -rw-r--r-- 1 root root 100720 Apr 2 2012 libtotem-cone-plugin.so -rw-r--r-- 1 root root 105424 Apr 2 2012 libtotem-gmp-plugin.so -rw-r--r-- 1 root root 72040 Apr 2 2012 libtotem-mully-plugin.so -rw-r--r-- 1 root root 80560 Apr 2 2012 libtotem-narrowspace-plugin.so I see this in the output of a talos run: 12:20:55 INFO - DEBUG : command line: /builds/slave/test/build/application/firefox/firefox -profile /tmp/tmpQZ2ppq/profile http://localhost/getInfo.html 12:20:56 INFO - LoadPlugin: failed to initialize shared library libXt.so [libXt.so: cannot open shared object file: No such file or directory] 12:20:56 INFO - LoadPlugin: failed to initialize shared library libXext.so [libXext.so: cannot open shared object file: No such file or directory] 12:20:56 INFO - LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/librhythmbox-itms-detection-plugin.so [/usr/lib/mozilla/plugins/librhythmbox-itms-detection-plugin.so: wrong ELF class: ELFCLASS64] 12:20:56 INFO - LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-cone-plugin.so [/usr/lib/mozilla/plugins/libtotem-cone-plugin.so: wrong ELF class: ELFCLASS64] 12:20:56 INFO - LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-mully-plugin.so [/usr/lib/mozilla/plugins/libtotem-mully-plugin.so: wrong ELF class: ELFCLASS64] 12:20:56 INFO - LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-gmp-plugin.so [/usr/lib/mozilla/plugins/libtotem-gmp-plugin.so: wrong ELF class: ELFCLASS64] 12:20:56 INFO - LoadPlugin: failed to initialize shared library /usr/lib/mozilla/plugins/libtotem-narrowspace-plugin.so [/usr/lib/mozilla/plugins/libtotem-narrowspace-plugin.so: wrong ELF class: ELFCLASS64] 12:20:59 INFO - __metrics Screen width/height:1600/1200 12:20:59 INFO - colorDepth:24 12:20:59 INFO - Browser inner width/height: 1024/697 12:20:59 INFO - __metrics 12:20:59 INFO - JavaScript error: resource:///modules/WebappManager.jsm, line 48: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIObserverService.removeObserver] 12:21:01 INFO - DEBUG : initialized firefox 12:21:01 INFO - DEBUG : command line: /builds/slave/test/build/application/firefox/firefox -profile /tmp/tmpQZ2ppq/profile -tp file:/builds/slave/test/build/talos_repo/talos/page_load_test/canvasmark/canvasmark.manifest -tpchrome -tpnoisy -tpcycles 5 -tppagecycles 1
Flags: needinfo?(armenzg)
Comment 24•9 years ago
|
||
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #23) > jlund, could you please have a look at a working machine and check for the > permission of these files? > > [cltbld@talos-linux64-ix-002 test]$ ls -l /usr/lib/mozilla/plugins/ > total 364 > -rw-r--r-- 1 root root 6048 Jul 9 2012 > librhythmbox-itms-detection-plugin.so > -rw-r--r-- 1 root root 100720 Apr 2 2012 libtotem-cone-plugin.so > -rw-r--r-- 1 root root 105424 Apr 2 2012 libtotem-gmp-plugin.so > -rw-r--r-- 1 root root 72040 Apr 2 2012 libtotem-mully-plugin.so > -rw-r--r-- 1 root root 80560 Apr 2 2012 libtotem-narrowspace-plugin.so On talos-linux64-ix-085, I see the same thing: [cltbld@talos-linux64-ix-085 ~]$ ls -l /usr/lib/mozilla/plugins/ total 364 -rw-r--r-- 1 root root 6048 Jul 9 2012 librhythmbox-itms-detection-plugin.so -rw-r--r-- 1 root root 100720 Apr 2 2012 libtotem-cone-plugin.so -rw-r--r-- 1 root root 105424 Apr 2 2012 libtotem-gmp-plugin.so -rw-r--r-- 1 root root 72040 Apr 2 2012 libtotem-mully-plugin.so -rw-r--r-- 1 root root 80560 Apr 2 2012 libtotem-narrowspace-plugin.so
Comment 25•9 years ago
|
||
I don't know if it is related. Just attaching the screenshot of one of the machines with issues. The bar on the left is not there.
Comment 26•9 years ago
|
||
We still have a bad slave running (bad svg run): http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1438718104/b2g-inbound_ubuntu32_hw_test-svgr-bm104-tests1-linux-build69.txt.gz Here's a good svg run: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1438720319/b2g-inbound_ubuntu32_hw_test-svgr-bm105-tests1-linux-build226.txt.gz From comparing the logs and some string manipulation (:%s/\d\d:\d\d:\d\d//g) the only difference I see on a bad run is this: warning: s3-us-west-2.amazonaws.com certificate with fingerprint d2::be:33:1e:fa:6b:e2:9f:51:e1:8a:ab::64:e7 not verified (check hostfingerprints or web.cacerts config setting) Nothing to go there.
Comment 27•9 years ago
|
||
I believe that runner is still running (from /var/log/syslog) about the time that the job was running [1] Some interesting symptoms: * talos-linux32-003 was disabled on slavealloc yet taking jobs on production * About 13:55:09 today it was running a buildbot job and runner was running [1] * I could not stay connected to that machine for a while. Possible runner kills the ssh server * The desktop environment looks messed up (see attachment) ** Perhaps runner is killing it or restarting it These symptoms are not faced by talos-linux64-ix-002 which jmaher was using but it might have been in such state before it was loaned. Perhaps the cleaning process of a loaner fixes the underlying issue. Could someone please investigate talos-linux32-003 and determine what is going on? I would have tried to reboot the machine but I would not know if I would be clearing up the issue. Do we have metrics for this host? [1] Aug 5 11:54:59 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-checkout_tools", "result": "OK"} Aug 5 11:55:00 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-config_hgrc", "result": "RUNNING"} Aug 5 11:55:01 talos-linux32-ix-003 0-config_hgrc: starting (max time 600s) Aug 5 11:55:02 talos-linux32-ix-003 0-config_hgrc: OK Aug 5 11:55:02 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "0-config_hgrc", "result": "OK"} Aug 5 11:55:03 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "RUNNING"} Aug 5 11:55:04 talos-linux32-ix-003 1-cleanslate: starting (max time 600s) Aug 5 11:55:05 talos-linux32-ix-003 1-cleanslate: OK Aug 5 11:55:05 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanslate", "result": "OK"} Aug 5 11:55:06 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanup", "result": "RUNNING"} Aug 5 11:55:07 talos-linux32-ix-003 1-cleanup: starting (max time 600s) Aug 5 11:55:08 talos-linux32-ix-003 1-cleanup: OK Aug 5 11:55:08 talos-linux32-ix-003 running post-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-cleanup", "result": "OK"} Aug 5 11:55:09 talos-linux32-ix-003 running pre-task hook: /opt/runner/task_hook.py {"try_num": 1, "max_retries": 5, "task": "1-mig_agent", "result": "RUNNING"} Aug 5 11:55:10 talos-linux32-ix-003 1-mig_agent: starting (max time 600s) Aug 5 11:55:17 talos-linux32-ix-003 1-mig_agent: OK
Comment 28•9 years ago
|
||
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #27) > * I could not stay connected to that machine for a while. Possible runner > kills the ssh server > * The desktop environment looks messed up (see attachment) > ** Perhaps runner is killing it or restarting it > > These symptoms are not faced by talos-linux64-ix-002 which jmaher was using > but it might have been in such state before it was loaned. Perhaps the > cleaning process of a loaner fixes the underlying issue. This is expected. buildbot is started via a runner task after it has completed all the pre-flight tasks. The list of tasks is removed as part of the loan, so jmaher wouldn't have seen this. > Could someone please investigate talos-linux32-003 and determine what is > going on? > I would have tried to reboot the machine but I would not know if I would be > clearing up the issue. One of the other tasks is cleanslate. cleanslate tries to build a list of allowed running processes and actively terminates things that are not in that list. I think this is likely where this is falling down, i.e. I suspect that cleanslate is killing off things like ssh connections, VNC sessions, possibly even the graphics context required for the tests themselves. > Do we have metrics for this host? We have logs in papertrail, but not sure you have access: https://papertrailapp.com/systems/talos-linux32-ix-003/events
Comment 29•9 years ago
|
||
I discussed this yesterday with Armen and Rail. From the screenshot it seems that the window manager (Unity) is not running, so it's either not starting, crashing, or being killed. I'm going to try connecting a talos-linux64 machine to my tests master in staging today and running it with cleanslate disabled to see whether we end up with a functional window manager.
Comment 30•9 years ago
|
||
I have talos-linux64-ix-004 running against my staging tests master with cleanslate disabled, and it's actually running the test instead of timing out. That's a pretty clear indication to me that cleanslate is being too aggressive. I'm going to dump the process list from talos-linux64-ix-004 running in this state and compare it against the process list of a machine running cleanslate to narrow down the processes we should be white-listing.
Comment 31•9 years ago
|
||
...and now I have no idea what's going on. Every single talos test I've run against my tests server has passed, regardless of whether cleanslate has been disabled or not. I'm going to try re-imaging talos-linux64-ix-004 to see whether I can recreate the standard bustage. I'm way too pessimistic to believe this has actually been fixed.
Comment 32•9 years ago
|
||
Does the window manager get fixed? If you VNC into a machine after your changes, can you see Firefox/talos running?
Comment 33•9 years ago
|
||
After a fresh re-install, the machine (talos-linux64-ix-004) will hang in talos as shown in the attached screenshot. This indicates to me that apache is not starting correctly, or the machine is being marked as ready for service *before* apache is properly installed. On a hunch, I tried simply rebooting the machine. It took the next job and and had no trouble finding localhost/getInfo.html. This indicates to me that the fix is likely in puppet, i.e. verifying that apache is properly installed before the final reboot that would return the slave to production, or at the very worst simply adding another reboot to that cycle. I'm going to try to replicate this behavior on another slave before I dive into puppet.
Comment 34•9 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #33) > I'm going to try to replicate this behavior on another slave before I dive > into puppet. Was able to replicate on talos-linux64-ix-001 after a re-image, and verified that running "service apache2 restart" was enough to start serving content properly. Now, to figure out how to invoke an apache restart via puppet...
Comment 35•9 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #34) > Now, to figure out how to invoke an apache restart via puppet... AFAICT, this should already be happening. Here's the relevant log snippet from the initial puppet run on talos-linux64-ix-001: Aug 10 19:57:43 talos-linux64-ix-001 puppet-agent[941]: (/Stage[main]/Packages::Httpd/Package[apache2]/ensure) ensure changed 'purged' to 'latest' Aug 10 19:57:43 talos-linux64-ix-001 puppet-agent[941]: (/Stage[main]/Talos/Httpd::Config[talos.conf]/File[/etc/apache2/sites-enabled/talos.conf]/ensure) defined content as '{md5}be879b3a62da323700e7fa3badeb9acb' Aug 10 19:57:45 talos-linux64-ix-001 puppet-agent[941]: (/Stage[main]/Httpd/Service[httpd]) Triggered 'refresh' from 1 events Maybe the refresh action on Ubuntu isn't doing what we expect?
Comment 36•9 years ago
|
||
While I work on the puppet fix, I'm going to start re-imaging all of the affected slaves in the blocker list. It's sufficient to simply reboot each slave an extra time after re-imaging and prior to putting them back into service, and since we have concern about the size of these pools (bug 1193025), we should get as many back into service as possible.
Updated•9 years ago
|
No longer blocks: talos-linux32-ix-022
Updated•9 years ago
|
No longer blocks: talos-linux32-ix-026
Updated•9 years ago
|
No longer blocks: talos-linux32-ix-008
Updated•9 years ago
|
Blocks: talos-linux32-ix-026
Updated•9 years ago
|
Blocks: talos-linux32-ix-008
Updated•9 years ago
|
Blocks: talos-linux32-ix-022
Comment 37•9 years ago
|
||
The same fix was insufficient for the linux32 machines, so Ryan ended up disabling all the re-imaged linux32 slaves yesterday. Something is still wrong with the graphics on these machines: the tests won't start, and I can't connect via VNC like I can on the linux64 iX machines. Rail: you set these up originally, and generally know more about linux. Can you have a look to see if it's something obvious?
Blocks: 1193025
Comment 39•9 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #38) > Rail: see comment #37 talos-linux32-ix-001 is currently attached to my staging master if you need a machine to poke at. http://dev-master2.bb.releng.use1.mozilla.com:8045/buildslaves/talos-linux32-ix-001
Comment 40•9 years ago
|
||
I suspect that the following should fix the issue: https://gist.github.com/rail/a10d90a520181b44f88a
Flags: needinfo?(rail)
Comment 41•9 years ago
|
||
Attachment #8648104 -
Flags: review?(bugspam.Callek)
Comment 42•9 years ago
|
||
Comment on attachment 8648104 [details] [diff] [review] notify_httpd.diff Review of attachment 8648104 [details] [diff] [review]: ----------------------------------------------------------------- I'd be surprised if this fixes linux32 as described (given it had extra restarts and such) but is the right thing to do anyway.
Attachment #8648104 -
Flags: review?(bugspam.Callek) → review+
Comment 43•9 years ago
|
||
I think this is what happens here: 1) puppet installs apache and starts it (ensure => running) 2) puppet deletes the file above, but apache is already running and has all configs in memory
Comment 44•9 years ago
|
||
Comment on attachment 8648104 [details] [diff] [review] notify_httpd.diff remote: https://hg.mozilla.org/build/puppet/rev/5030c9f26096 remote: https://hg.mozilla.org/build/puppet/rev/8a884831b65d
Attachment #8648104 -
Flags: checked-in+
Comment 45•9 years ago
|
||
Found these: [root@talos-linux32-ix-001 init]# tail /var/log/upstart/x11.log Current version of pixman: 0.28.2 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.0.log", Time: Fri Aug 14 09:53:22 2015 (==) Using config file: "/etc/X11/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d" FATAL: Module nvidia_310 not found. [root@talos-linux32-ix-001 init]# modprobe nvidia_310 FATAL: Module nvidia_310 not found. The command above works fine on talos-linux32-ix-043
Comment 46•9 years ago
|
||
[root@talos-linux32-ix-043 ~]# ls /lib/modules/`uname -r`/updates/dkms/nvidia_310.ko /lib/modules/3.2.0-76-generic-pae/updates/dkms/nvidia_310.ko [root@talos-linux32-ix-001 init]# ls /lib/modules/`uname -r`/updates/dkms/nvidia_310.ko ls: cannot access /lib/modules/3.2.0-76-generic-pae/updates/dkms/nvidia_310.ko: No such file or directory
Comment 47•9 years ago
|
||
[root@talos-linux32-ix-043 ~]# dkms status nvidia-310 nvidia-310, 310.32, 3.2.0-76-generic-pae, i686: installed nvidia-310, 310.32, 3.5.0-18-generic, i686: installed [root@talos-linux32-ix-001 init]# dkms status nvidia-310 nvidia-310, 310.32, 3.5.0-18-generic, i686: installed getting there :)
Comment 48•9 years ago
|
||
It looks like nvidia-310 needs to be installed after we downgrade the latest kernel (3.5.0-18) to 3.2.0-76-generic-pae. Otherwise the dkms package adds itself only to 3.5.0-18. Also, there is no need to keep 3.5.0-18 around
Attachment #8648176 -
Flags: review?(bugspam.Callek)
Comment 49•9 years ago
|
||
Comment on attachment 8648176 [details] [diff] [review] nvidia_needs_kernel.diff Sure, why not -- I trust :rail here
Attachment #8648176 -
Flags: review?(bugspam.Callek) → review+
Comment 50•9 years ago
|
||
Comment on attachment 8648176 [details] [diff] [review] nvidia_needs_kernel.diff remote: https://hg.mozilla.org/build/puppet/rev/2ff9fe9a5e4f remote: https://hg.mozilla.org/build/puppet/rev/92324b6e7493
Updated•9 years ago
|
Attachment #8648176 -
Flags: checked-in+
Comment 51•9 years ago
|
||
I'm re-imaging talos-linux32-ix-001 to take the fix.
Comment 52•9 years ago
|
||
(In reply to Chris Cooper (on PTO until Aug 31) [:coop] from comment #51) > I'm re-imaging talos-linux32-ix-001 to take the fix. No improvement, sadly.
Comment 53•9 years ago
|
||
I dug into the issue a little bit deeper yesterday. tl;dr: dkms adds kernel modules only to current and latest kernel modules. In our case we the system is bootstrapped with 3.5.0-something kernel (coming from the xorg-edgers repo) and then we downgrade the kernel to 3.2.0-76 (or something). dkms ignores the final kernel because its version is less than the version of running kernel, so we have dkms installed nvidia modules only to current and latest (3.5.0) only. As a temporary solution we can remove 3.5.0 from the repos, but the same issue may hit us in the future again. It'd be probably better to create a helper script, which on boot checks if all available (dkms status) dkms modules are built/installed for the current kernel. The following patch addresses the issues above and I tested it on talos-linux32-ix-001. Some highlights: * remove all obsoleted kernel types regardless of suffix (we install -generic on 32-bit first, then switch to generic-pae) * The scripts is invoked by upstart before x11 starts (on starting x11) to make sure we have modules before we start X /var/log/upstart/nvidia-310.log contains all console output generated by the script.
Attachment #8648400 -
Flags: review?(bugspam.Callek)
Comment 54•9 years ago
|
||
Comment on attachment 8648400 [details] [diff] [review] nvidia_rebuild_modules.diff Review of attachment 8648400 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/packages/manifests/kernel.pp @@ +66,5 @@ > if $kernelrelease == $kernel_ver and ! empty($obsolete_kernel_list) { > + $obsolete_kernels = suffix( prefix( $obsolete_kernel_list, 'linux-image-' ), '-generic') > + $obsolete_headers = suffix( prefix( $obsolete_kernel_list, 'linux-headers-' ), '-generic' ) > + $obsolete_kernels_pae = suffix( prefix( $obsolete_kernel_list, 'linux-image-' ), '-generic-pae') > + $obsolete_headers_pae = suffix( prefix( $obsolete_kernel_list, 'linux-headers-' ), '-generic-pae' ) Looks like we no longer need http://hg.mozilla.org/build/puppet/annotate/27f57f829847/modules/packages/manifests/kernel.pp#l47 ::: modules/packages/templates/nvidia_dkms.conf.erb @@ +8,5 @@ > + dkms add -m nvidia-<%= @nvidia_version %> -v <%= @nvidia_full_version %> -k `uname -r` > + /usr/lib/dkms/dkms_autoinstaller start || true > + modprobe nvidia-<%= @nvidia_version %> || true > + fi > +end script I don't understand any of these commands, (except the "am I installed" one), but I trust your testing here
Attachment #8648400 -
Flags: review?(bugspam.Callek) → review+
Comment 55•9 years ago
|
||
Comment on attachment 8648400 [details] [diff] [review] nvidia_rebuild_modules.diff (In reply to Justin Wood (:Callek) from comment #54) > Looks like we no longer need > http://hg.mozilla.org/build/puppet/annotate/27f57f829847/modules/packages/ > manifests/kernel.pp#l47 I think we do, we still use the PAE kernel on 32-bit linux. > I don't understand any of these commands, (except the "am I installed" one), > but I trust your testing here Hehe. :) Thanks. remote: https://hg.mozilla.org/build/puppet/rev/b24bf2e5c400 remote: https://hg.mozilla.org/build/puppet/rev/cdef239d9df5
Attachment #8648400 -
Flags: checked-in+
Comment 56•9 years ago
|
||
re-imaging talos-linux32-ix-001 now...
Comment 57•9 years ago
|
||
looking much better now: [root@talos-linux32-ix-001 ~]# dkms status nvidia-310, 310.32, 3.2.0-76-generic-pae, i686: installed v4l2loopback, 0.6.1, 3.2.0-76-generic-pae, i686: installed [root@talos-linux32-ix-001 ~]# dpkg -l |grep linux-image |grep ^ii ii linux-image-3.2.0-76-generic-pae 3.2.0-76.111 Linux kernel image for version 3.2.0 on 32 bit x86 SMP ii linux-image-generic-pae 3.2.0.76.90 Generic Linux kernel image I forced some tests to see the greens!
Comment 58•9 years ago
|
||
I see 3 green results in a row. \o/ Assuming it's fixed now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 59•9 years ago
|
||
\o/ \ / | | / \ /o\
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•