Closed
Bug 1465059
Opened 6 years ago
Closed 6 years ago
Slave loan request for Win7-32 debug to ccorcoran
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P1)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: ccorcoran, Unassigned)
Details
(Whiteboard: [ciduty][capacity][buildslaves][loaner])
Hello, I would like to loan Win7 x86 debug slave Reason for the loan: Bug 1435827 intermittent oranges in try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d906d169badab652398b5a0340778b975cd3ea4a Expected time for the loan: 1 week Thank you.
Comment 1•6 years ago
|
||
Hi Carl, I would start this by loaning a gecko-t-win7-32 machine but I just wanted to let you know that you may need a gecko-t-win7-32-gpu because the reftests are done on the gecko-t-win7-32-gpu.
Flags: needinfo?(ccorcoran)
Comment 2•6 years ago
|
||
The response got via IRC, he needs a gecko-t-windows7-32-gpu I will try to satisfy your request ASAP.
Flags: needinfo?(ccorcoran)
Comment 3•6 years ago
|
||
see: https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance#For_generic-worker_10.5.0_onwards
Reporter | ||
Comment 4•6 years ago
|
||
Rob this is very helpful; I'll give it a go. Thanks!
Comment 5•6 years ago
|
||
Hello Carl, Please let us know if everything went well with the Self provision and if you are able to work on and with the windows instance. thank you
Flags: needinfo?(ccorcoran)
Comment 6•6 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #3) > see: > > https://wiki.mozilla.org/ReleaseEngineering/How_To/ > Self_Provision_a_TaskCluster_Windows_Instance#For_generic-worker_10.5. > 0_onwards Thanks Rob. @ciduty - please update the docs on our loaning instructions to reference and point to one-click loaning procedure outlined here. End goal would to make it clear and explicit of when ciduty need to assist in loaning and when this is self-servable by devs. This way we don't need to ask relops or tc again.
Flags: needinfo?(ciduty)
Reporter | ||
Comment 7•6 years ago
|
||
Adrian, the provisioning worked fine though it ended abruptly after about 6 hours, presumably being decommissioned as the doc warns. It wasn't enough time for me to complete my digging around, but I'll self-provision another tomorrow. The procedure is super helpful though; thanks!
Flags: needinfo?(ccorcoran)
Comment 8•6 years ago
|
||
We will close the bug for now as the main request is fulfilled. If there are any issues, please re-open bug.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 9•6 years ago
|
||
Update: My attempt to self-provision today did not work as it did yesterday. https://tools.taskcluster.net/groups/ZWKmfiu1R02IwDFDa6erJA/tasks/ZWKmfiu1R02IwDFDa6erJA/runs/0/artifacts The IP address in the rdpinfo.txt is not accessible. Is this a matter of VPN (which I am currently in the process of getting set up)? I would be quite happy at this point to get a "real" loaner that will survive longer than a few hours as well, as in the bug description. Possible?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 10•6 years ago
|
||
Hello Carl, To my knowledge, the TaskCluster Windows machines are self-served. As those servers expire after 30 minutes of idle/inactivity time we can't reliably provide them for you. Regarding with your VPN connection, see this page [1]. I hope that it will be useful. [1] - https://mana.mozilla.org/wiki/display/IT/Mozilla+VPN I'm still looking for how I can help you more.
Reporter | ||
Comment 11•6 years ago
|
||
Is it not possible to request a loaner that will live for a fixed amount of time? 30 minutes is very restrictive. I need to know my work will not be lost if I have a meeting or lunch break. Regarding VPN, my query is that I cannot access the IP address given in rdpinfo.txt from my local machine (I am a remote worker). Yesterday I had no problem connecting to the self-provisioned loaner without VPN, so today's result confused me. I'm asking if I need VPN for this one? Or why else would the IP be inaccessible from my machine?
Flags: needinfo?(riman)
Comment 12•6 years ago
|
||
I have tried to provide a gecko-t-win7-32-gpu loaner following the steps from https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#Windows_7_AWS_machines I have encountered a problem when I tried to find an IP address. " -bash: cloud-tools/scripts/free_ips.py: Permission denied " @Zsolt I will assign this task to you. Please continue to figure out how to loan this machine. Thank you.
Flags: needinfo?(zfay)
Flags: needinfo?(riman)
Flags: needinfo?(jlund)
Flags: needinfo?(ciduty)
Comment 13•6 years ago
|
||
(In reply to Radu Iman[:riman] from comment #12) > I have tried to provide a gecko-t-win7-32-gpu loaner following the steps > from > https://wiki.mozilla.org/ReleaseEngineering/How_To/ > Loan_a_Slave#Windows_7_AWS_machines I believe this is from Buildbot. We should figure out where these docs apply and whether we should remove entirely. (In reply to Carl Corcoran [:ccorcoran] from comment #11) > Is it not possible to request a loaner that will live for a fixed amount of > time? 30 minutes is very restrictive. I need to know my work will not be > lost if I have a meeting or lunch break. > Taskcluster based AWS instances are provisioned differently than they were in Buildbot. I don't believe we have a process for creating a *new* instance based on production but is long lived. Currently we take take out an already existing instance out of the pool but, as it is spot based and subject to the same efficiency designed constraints as the rest of production, it will terminate if inactive. Rob (from relops), can confirm rob, when you get the opportunity, can you work with whoever is on shift in #ci and figure out what our options are for loaning given the ask in comment 11 and our out of date docs from comment 12? re: from comment 6, I want ciduty to make an exhaustive list of docs, separated by worker type so that they and devs know what is and isn't possible. End goal is to not have relops or taskcluster teams involved at all and disrupt free.
Flags: needinfo?(jlund) → needinfo?(rthijssen)
Comment 14•6 years ago
|
||
(In reply to Radu Iman[:riman] from comment #12) > I have encountered a problem when I tried to find an IP address. " -bash: > cloud-tools/scripts/free_ips.py: Permission denied " > @Zsolt I will assign this task to you. Please continue to figure out how to > loan this machine. Thank you. as Jake mentioned, free_ips.py uses invtool which is a tool from the decommissioned Inventory service. So that won't work. If the rest of the steps work and apply to the same workers that Carl wants to loan from production (gecko-t-win7-32-gpu and gecko-t-win7-32), we can skip free_ips.py step (update docs) and instead point devs to the public ip address of the worker, rather than a custom dns record.
Comment 15•6 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #13) > rob, when you get the opportunity, can you work with whoever is on shift in > #ci and figure out what our options are for loaning given the ask in comment > 11 and our out of date docs from comment 12? > > re: from comment 6, I want ciduty to make an exhaustive list of docs, > separated by worker type so that they and devs know what is and isn't > possible. End goal is to not have relops or taskcluster teams involved at > all and disrupt free. s/rob/jake/ given timezones and since Jake has recent context.
Flags: needinfo?(rthijssen) → needinfo?(jwatkins)
Updated•6 years ago
|
Flags: needinfo?(zfay)
Comment 16•6 years ago
|
||
Generic Worker took ownership of the loaner creation process so pmoore will be better placed to answer questions about it's mechanics. we had some workarounds in the old occ process for keeping machines longer by killing the HaltOnIdle process and admin access but i don't think they will work now and i'm not sure how to do this under the new setup. Also not sure if keeping the rdp session alive will also keep the machine alive. it used to, but again, i'm not familiar with the new mechanics. i can envisage that for special use cases, we could spin up a dedicated (non-spot) instance using the same amis as are used in production, but this takes a bit of coordination between the user, someone with access to the tc ec2 account and familiarity with ec2/ami instance instantiation and disabling gw, setting credentials, etc., but let's first see if the new loaner mechanism can meet this users needs as its a fair bit of work to work around it.
Flags: needinfo?(jwatkins) → needinfo?(pmoore)
Comment 17•6 years ago
|
||
(In reply to Carl Corcoran [:ccorcoran] from comment #9) > Update: My attempt to self-provision today did not work as it did yesterday. > > https://tools.taskcluster.net/groups/ZWKmfiu1R02IwDFDa6erJA/tasks/ > ZWKmfiu1R02IwDFDa6erJA/runs/0/artifacts > > The IP address in the rdpinfo.txt is not accessible. Is this a matter of VPN > (which I am currently in the process of getting set up)? You shouldn't need VPN access for this - it could be a problem with RDP access for worker type gecko-t-win7-32-gpu - was your previous successful loaner also for gecko-t-win7-32-gpu? I'm wondering if there is a problem with the security groups of this worker type. > I would be quite happy at this point to get a "real" loaner that will > survive longer than a few hours as well, as in the bug description. Possible? Is 12 hours enough? That is how long they currently live for. If not, the wiki page from comment 3 explains how to get admin access, and using this, you can kill the generic-worker process, in which case your instance will live for up to 96 hours (although spot instances can be terminated by AWS at any time, unfortunately).
Flags: needinfo?(pmoore)
Comment 18•6 years ago
|
||
(In reply to Radu Iman[:riman] from comment #10) > Hello Carl, > > To my knowledge, the TaskCluster Windows machines are self-served. As those > servers expire after 30 minutes of idle/inactivity time we can't reliably > provide them for you. This is incorrect - please see the wiki page from comment 3.
Comment 19•6 years ago
|
||
(In reply to Radu Iman[:riman] from comment #12) > I have tried to provide a gecko-t-win7-32-gpu loaner following the steps > from > https://wiki.mozilla.org/ReleaseEngineering/How_To/ > Loan_a_Slave#Windows_7_AWS_machines > I have encountered a problem when I tried to find an IP address. " -bash: > cloud-tools/scripts/free_ips.py: Permission denied " > @Zsolt I will assign this task to you. Please continue to figure out how to > loan this machine. Thank you. This is the wrong wiki page - see comment 3.
Reporter | ||
Comment 20•6 years ago
|
||
I had indeed been previously successful in provisioning the loaner. Here's a full account of my experience in the last week of self-provisioning: 1. Followed directions and successfully connected to a gecko-t-win7-32 machine, but when I ran "Z:\task_1527672877\command_000000_wrapper.bat", I got errors (https://pastebin.mozilla.org/9086926). As I tried to understand why my task wouldn't run, I lost connection with the machine. Never did understand why I got that error. 2. Tried the same instructions again but the task failed to parse. I assume I must have made a typo. 3. Tried again, this time on a GPU machine. rdpinfo.txt was generated but I was unable to connect to the IP address. I'll try again today. 12 hours should be enough. I'll report back here how it goes.
Reporter | ||
Comment 21•6 years ago
|
||
Results today: I successfully self-provisioned 3 machines via these respective tasks: - https://tools.taskcluster.net/groups/FcHc-GamTfOBhIKR44gBHQ/tasks/FcHc-GamTfOBhIKR44gBHQ/runs/0/artifacts - https://tools.taskcluster.net/groups/JeUUi2SORhCLlhbiaLMsnQ/tasks/JeUUi2SORhCLlhbiaLMsnQ/runs/0/artifacts - https://tools.taskcluster.net/groups/QRYVPx8hREuyC6fL3q5Ncw/tasks/QRYVPx8hREuyC6fL3q5Ncw/runs/0/artifacts No connection issues, however I'm unable to run my tasks on them. Steps I'm taking: 1. open up cmd.exe (I made certain the original task run had completed) 2. cd /d Z:\task_1528118652\ 3. run command_000000_wrapper.bat In the resulting error log, the following exception: > 14:00:41 ERROR - Return code: 1 > 14:00:41 ERROR - 1 not in success codes: [0] > 14:00:41 FATAL - Uncaught exception: Traceback (most recent call last): > 14:00:41 FATAL - File "Z:\task_1528118652\mozharness\mozharness\base\script.py", line 2080, in run > 14:00:41 FATAL - self.run_action(action) > 14:00:41 FATAL - File "Z:\task_1528118652\mozharness\mozharness\base\script.py", line 2018, in run_action > 14:00:41 FATAL - self._possibly_run_method("preflight_%s" % method_name) > 14:00:41 FATAL - File "Z:\task_1528118652\mozharness\mozharness\base\script.py", line 1959, in _possibly_run_method > 14:00:41 FATAL - return getattr(self, method_name)() > 14:00:41 FATAL - File "Z:\task_1528118652\mozharness\mozharness\mozilla\testing\testbase.py", line 733, in preflight_run_tests > 14:00:41 FATAL - self._run_cmd_checks(c.get('preflight_run_cmd_suites', [])) > 14:00:41 FATAL - File "Z:\task_1528118652\mozharness\mozharness\mozilla\testing\testbase.py", line 727, in _run_cmd_checks > 14:00:41 FATAL - fatal_exit_code=suite.get('fatal_exit_code', 3)) > 14:00:41 FATAL - File "Z:\task_1528118652\mozharness\mozharness\base\script.py", line 1465, in run_command > 14:00:41 FATAL - self.fatal("Halting on failure while running %s" % command, > 14:00:41 FATAL - TypeError: not all arguments converted during string formatting > 14:00:41 FATAL - Running post_fatal callback... > 14:00:41 FATAL - Exiting -1 Here's a task I created in the same way, but this time I didn't run the command_000000_wrapper.bat manually. - https://tools.taskcluster.net/groups/Xs00GOpZSluO-urR-4BNpQ/tasks/Xs00GOpZSluO-urR-4BNpQ Here, the task runs as expected. I suspect now if I run command_000000_wrapper.bat on that machine, I'll get the exception above instead of the task running. Am I missing another step before running command_000000_wrapper?
Flags: needinfo?(pmoore)
Comment 23•6 years ago
|
||
I get the same if I don't connect with the screen resolution 1280x1024: https://taskcluster-artifacts.net/Eij_hZaDSaqCr4CSFXbjQw/0/public/logs/log_error.log If I open the info log instead: https://taskcluster-artifacts.net/Eij_hZaDSaqCr4CSFXbjQw/0/public/logs/log_info.log then I see the following at the end: 09:46:11 INFO - Copy/paste: c:\mozilla-build\python\python.exe Z:\task_1530003585\mozharness\external_tools\mouse_and_screen_resolution.py --configuration-file Z:\task_1530003585\mozharness\external_tools\machine-configuration.json 09:46:14 INFO - Screen resolution (current): (2560, 1440) 09:46:14 INFO - Changing the screen resolution... 09:46:14 INFO - Screen resolution (new): (2560, 1440) 09:46:14 INFO - Mouse position (current): (1445, 825) 09:46:14 INFO - Mouse position (new): (1010, 10) 09:46:14 INFO - INFRA-ERROR: The new screen resolution or mouse positions are not what we expected 09:46:14 ERROR - Return code: 1 09:46:14 ERROR - 1 not in success codes: [0] 09:46:14 WARNING - setting return code to 3 09:46:14 INFO - Running post-action listener: _package_coverage_data 09:46:14 INFO - Running post-action listener: _resource_record_post_action 09:46:14 INFO - [mozharness: 2018-06-26 09:46:14.985000Z] Finished run-tests step (failed) 09:46:14 FATAL - Uncaught exception: Traceback (most recent call last): 09:46:14 FATAL - File "Z:\task_1530003585\mozharness\mozharness\base\script.py", line 2080, in run 09:46:14 FATAL - self.run_action(action) 09:46:14 FATAL - File "Z:\task_1530003585\mozharness\mozharness\base\script.py", line 2018, in run_action 09:46:14 FATAL - self._possibly_run_method("preflight_%s" % method_name) 09:46:14 FATAL - File "Z:\task_1530003585\mozharness\mozharness\base\script.py", line 1959, in _possibly_run_method 09:46:14 FATAL - return getattr(self, method_name)() 09:46:14 FATAL - File "Z:\task_1530003585\mozharness\mozharness\mozilla\testing\testbase.py", line 736, in preflight_run_tests 09:46:14 FATAL - self._run_cmd_checks(c.get('preflight_run_cmd_suites', [])) 09:46:14 FATAL - File "Z:\task_1530003585\mozharness\mozharness\mozilla\testing\testbase.py", line 730, in _run_cmd_checks 09:46:14 FATAL - fatal_exit_code=suite.get('fatal_exit_code', 3)) 09:46:14 FATAL - File "Z:\task_1530003585\mozharness\mozharness\base\script.py", line 1465, in run_command 09:46:14 FATAL - self.fatal("Halting on failure while running %s" % command, 09:46:14 FATAL - TypeError: not all arguments converted during string formatting 09:46:14 FATAL - Running post_fatal callback... 09:46:14 FATAL - Exiting -1 09:46:14 INFO - Running post-run listener: _resource_record_post_run 09:46:15 INFO - Total resource usage - Wall time: 5s; CPU: 6.0%; Read bytes: 0; Write bytes: 145659904; Read time: 0; Write time: 2 09:46:15 INFO - TinderboxPrint: CPU usage<br/>5.9% 09:46:15 INFO - TinderboxPrint: I/O read bytes / time<br/>0 / 0 09:46:15 INFO - TinderboxPrint: I/O write bytes / time<br/>145,659,904 / 2 09:46:15 INFO - TinderboxPrint: CPU idle<br/>38.0 (94.1%) 09:46:15 INFO - TinderboxPrint: CPU system<br/>0.7 (1.8%) 09:46:15 INFO - TinderboxPrint: CPU user<br/>1.6 (4.1%) 09:46:15 INFO - pull - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 09:46:15 INFO - install - Wall time: 3s; CPU: 9.0%; Read bytes: 0; Write bytes: 77539840; Read time: 0; Write time: 1 09:46:15 INFO - run-tests - Wall time: 3s; CPU: 1.0%; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0 09:46:15 INFO - Running post-run listener: copy_logs_to_upload_dir 09:46:15 INFO - Copying logs to upload dir... 09:46:15 INFO - mkdir: Z:\task_1530003585\build\upload\logs Do you also see the following line in your log_info.log? INFO - INFRA-ERROR: The new screen resolution or mouse positions are not what we expected See point 8) of https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance#For_generic-worker_10.5.0_onwards: > 8) Connect with screen resolution 1280x1024 ! Note, it is important to use this resolution for gecko tests, since this is the screen size used by the tests, and the screen size cannot change once you have made a connection. If the problem isn't this, please let me know. Thanks!
Flags: needinfo?(ccorcoran)
Reporter | ||
Comment 24•6 years ago
|
||
It looks like you're right! Even though this point is emphasized in the docs, I still didn't connect it to the errors. Next time I will definitely look closer at the INFO log. This is resolved for me.
Flags: needinfo?(ccorcoran)
Comment 25•6 years ago
|
||
No worries, glad it is working for you! :-)
Status: REOPENED → RESOLVED
Closed: 6 years ago → 6 years ago
Resolution: --- → INVALID
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•