Closed Bug 1465059 Opened 6 years ago Closed 6 years ago

Slave loan request for Win7-32 debug to ccorcoran

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: ccorcoran, Unassigned)

Details

(Whiteboard: [ciduty][capacity][buildslaves][loaner])

Hello,

I would like to loan Win7 x86 debug slave
Reason for the loan: Bug 1435827 intermittent oranges in try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d906d169badab652398b5a0340778b975cd3ea4a
Expected time for the loan: 1 week

Thank you.
Hi Carl,

I would start this by loaning a gecko-t-win7-32 machine but I just wanted to let you know that you may need a gecko-t-win7-32-gpu because the reftests are done on the gecko-t-win7-32-gpu.
Flags: needinfo?(ccorcoran)
The response got via IRC, he needs a gecko-t-windows7-32-gpu
I will try to satisfy your request ASAP.
Flags: needinfo?(ccorcoran)
Rob this is very helpful; I'll give it a go. Thanks!
Hello Carl,

Please let us know if everything went well with the Self provision and if you are able to work on and with the windows instance.

thank you
Flags: needinfo?(ccorcoran)
(In reply to Rob Thijssen (:grenade UTC+2) from comment #3)
> see:
> 
> https://wiki.mozilla.org/ReleaseEngineering/How_To/
> Self_Provision_a_TaskCluster_Windows_Instance#For_generic-worker_10.5.
> 0_onwards

Thanks Rob.

@ciduty - please update the docs on our loaning instructions to reference and point to one-click loaning procedure outlined here. End goal would to make it clear and explicit of when ciduty need to assist in loaning and when this is self-servable by devs. This way we don't need to ask relops or tc again.
Flags: needinfo?(ciduty)
Adrian, the provisioning worked fine though it ended abruptly after about 6 hours, presumably being decommissioned as the doc warns. It wasn't enough time for me to complete my digging around, but I'll self-provision another tomorrow.

The procedure is super helpful though; thanks!
Flags: needinfo?(ccorcoran)
We will close the bug for now as the main request is fulfilled.
If there are any issues, please re-open bug.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Update: My attempt to self-provision today did not work as it did yesterday.

https://tools.taskcluster.net/groups/ZWKmfiu1R02IwDFDa6erJA/tasks/ZWKmfiu1R02IwDFDa6erJA/runs/0/artifacts

The IP address in the rdpinfo.txt is not accessible. Is this a matter of VPN (which I am currently in the process of getting set up)?

I would be quite happy at this point to get a "real" loaner that will survive longer than a few hours as well, as in the bug description. Possible?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Hello Carl,

To my knowledge, the TaskCluster Windows machines are self-served. As those servers expire after 30 minutes of idle/inactivity time we can't reliably provide them for you.

Regarding with your VPN connection, see this page [1]. I hope that it will be useful.

[1] - https://mana.mozilla.org/wiki/display/IT/Mozilla+VPN

I'm still looking for how I can help you more.
Is it not possible to request a loaner that will live for a fixed amount of time? 30 minutes is very restrictive. I need to know my work will not be lost if I have a meeting or lunch break.

Regarding VPN, my query is that I cannot access the IP address given in rdpinfo.txt from my local machine (I am a remote worker). Yesterday I had no problem connecting to the self-provisioned loaner without VPN, so today's result confused me. I'm asking if I need VPN for this one? Or why else would the IP be inaccessible from my machine?
Flags: needinfo?(riman)
I have tried to provide a gecko-t-win7-32-gpu loaner following the steps from https://wiki.mozilla.org/ReleaseEngineering/How_To/Loan_a_Slave#Windows_7_AWS_machines   
I have encountered a problem when I tried to find an IP address.  " -bash: cloud-tools/scripts/free_ips.py: Permission denied "
@Zsolt I will assign this task to you. Please continue to figure out how to loan this machine. Thank you.
Flags: needinfo?(zfay)
Flags: needinfo?(riman)
Flags: needinfo?(jlund)
Flags: needinfo?(ciduty)
(In reply to Radu Iman[:riman] from comment #12)
> I have tried to provide a gecko-t-win7-32-gpu loaner following the steps
> from
> https://wiki.mozilla.org/ReleaseEngineering/How_To/
> Loan_a_Slave#Windows_7_AWS_machines   

I believe this is from Buildbot. We should figure out where these docs apply and whether we should remove entirely.


(In reply to Carl Corcoran [:ccorcoran] from comment #11)
> Is it not possible to request a loaner that will live for a fixed amount of
> time? 30 minutes is very restrictive. I need to know my work will not be
> lost if I have a meeting or lunch break.
> 

Taskcluster based AWS instances are provisioned differently than they were in Buildbot. I don't believe we have a process for creating a *new* instance based on production but is long lived. Currently we take take out an already existing instance out of the pool but, as it is spot based and subject to the same efficiency designed constraints as the rest of production, it will terminate if inactive. Rob (from relops), can confirm

rob, when you get the opportunity, can you work with whoever is on shift in #ci and figure out what our options are for loaning given the ask in comment 11 and our out of date docs from comment 12?

re: from comment 6, I want ciduty to make an exhaustive list of docs, separated by worker type so that they and devs know what is and isn't possible. End goal is to not have relops or taskcluster teams involved at all and disrupt free.
Flags: needinfo?(jlund) → needinfo?(rthijssen)
(In reply to Radu Iman[:riman] from comment #12)   
> I have encountered a problem when I tried to find an IP address.  " -bash:
> cloud-tools/scripts/free_ips.py: Permission denied "
> @Zsolt I will assign this task to you. Please continue to figure out how to
> loan this machine. Thank you.

as Jake mentioned, free_ips.py uses invtool which is a tool from the decommissioned Inventory service. So that won't work. If the rest of the steps work and apply to the same workers that Carl wants to loan from production (gecko-t-win7-32-gpu and gecko-t-win7-32), we can skip free_ips.py step (update docs) and instead point devs to the public ip address of the worker, rather than a custom dns record.
(In reply to Jordan Lund (:jlund) from comment #13)
> rob, when you get the opportunity, can you work with whoever is on shift in
> #ci and figure out what our options are for loaning given the ask in comment
> 11 and our out of date docs from comment 12?
> 
> re: from comment 6, I want ciduty to make an exhaustive list of docs,
> separated by worker type so that they and devs know what is and isn't
> possible. End goal is to not have relops or taskcluster teams involved at
> all and disrupt free.

s/rob/jake/ given timezones and since Jake has recent context.
Flags: needinfo?(rthijssen) → needinfo?(jwatkins)
Flags: needinfo?(zfay)
Generic Worker took ownership of the loaner creation process so pmoore will be better placed to answer questions about it's mechanics. we had some workarounds in the old occ process for keeping machines longer by killing the HaltOnIdle process and admin access but i don't think they will work now and i'm not sure how to do this under the new setup. Also not sure if keeping the rdp session alive will also keep the machine alive. it used to, but again, i'm not familiar with the new mechanics.

i can envisage that for special use cases, we could spin up a dedicated (non-spot) instance using the same amis as are used in production, but this takes a bit of coordination between the user, someone with access to the tc ec2 account and familiarity with ec2/ami instance instantiation and disabling gw, setting credentials, etc., but let's first see if the new loaner mechanism can meet this users needs as its a fair bit of work to work around it.
Flags: needinfo?(jwatkins) → needinfo?(pmoore)
(In reply to Carl Corcoran [:ccorcoran] from comment #9)
> Update: My attempt to self-provision today did not work as it did yesterday.
> 
> https://tools.taskcluster.net/groups/ZWKmfiu1R02IwDFDa6erJA/tasks/
> ZWKmfiu1R02IwDFDa6erJA/runs/0/artifacts
> 
> The IP address in the rdpinfo.txt is not accessible. Is this a matter of VPN
> (which I am currently in the process of getting set up)?

You shouldn't need VPN access for this - it could be a problem with RDP access for worker type gecko-t-win7-32-gpu - was your previous successful loaner also for gecko-t-win7-32-gpu? I'm wondering if there is a problem with the security groups of this worker type.
 

> I would be quite happy at this point to get a "real" loaner that will
> survive longer than a few hours as well, as in the bug description. Possible?

Is 12 hours enough? That is how long they currently live for. If not, the wiki page from comment 3 explains how to get admin access, and using this, you can kill the generic-worker process, in which case your instance will live for up to 96 hours (although spot instances can be terminated by AWS at any time, unfortunately).
Flags: needinfo?(pmoore)
(In reply to Radu Iman[:riman] from comment #10)
> Hello Carl,
> 
> To my knowledge, the TaskCluster Windows machines are self-served. As those
> servers expire after 30 minutes of idle/inactivity time we can't reliably
> provide them for you.

This is incorrect - please see the wiki page from comment 3.
(In reply to Radu Iman[:riman] from comment #12)
> I have tried to provide a gecko-t-win7-32-gpu loaner following the steps
> from
> https://wiki.mozilla.org/ReleaseEngineering/How_To/
> Loan_a_Slave#Windows_7_AWS_machines   
> I have encountered a problem when I tried to find an IP address.  " -bash:
> cloud-tools/scripts/free_ips.py: Permission denied "
> @Zsolt I will assign this task to you. Please continue to figure out how to
> loan this machine. Thank you.

This is the wrong wiki page - see comment 3.
I had indeed been previously successful in provisioning the loaner. Here's a full account of my experience in the last week of self-provisioning:
1. Followed directions and successfully connected to a gecko-t-win7-32 machine, but when I ran "Z:\task_1527672877\command_000000_wrapper.bat", I got errors (https://pastebin.mozilla.org/9086926). As I tried to understand why my task wouldn't run, I lost connection with the machine. Never did understand why I got that error.
2. Tried the same instructions again but the task failed to parse. I assume I must have made a typo.
3. Tried again, this time on a GPU machine. rdpinfo.txt was generated but I was unable to connect to the IP address.

I'll try again today. 12 hours should be enough. I'll report back here how it goes.
Results today:
I successfully self-provisioned 3 machines via these respective tasks:
- https://tools.taskcluster.net/groups/FcHc-GamTfOBhIKR44gBHQ/tasks/FcHc-GamTfOBhIKR44gBHQ/runs/0/artifacts
- https://tools.taskcluster.net/groups/JeUUi2SORhCLlhbiaLMsnQ/tasks/JeUUi2SORhCLlhbiaLMsnQ/runs/0/artifacts
- https://tools.taskcluster.net/groups/QRYVPx8hREuyC6fL3q5Ncw/tasks/QRYVPx8hREuyC6fL3q5Ncw/runs/0/artifacts

No connection issues, however I'm unable to run my tasks on them.
Steps I'm taking: 
1. open up cmd.exe (I made certain the original task run had completed)
2. cd /d Z:\task_1528118652\
3. run command_000000_wrapper.bat

In the resulting error log, the following exception:

> 14:00:41    ERROR - Return code: 1
> 14:00:41    ERROR - 1 not in success codes: [0]
> 14:00:41    FATAL - Uncaught exception: Traceback (most recent call last):
> 14:00:41    FATAL -   File "Z:\task_1528118652\mozharness\mozharness\base\script.py", line 2080, in run
> 14:00:41    FATAL -     self.run_action(action)
> 14:00:41    FATAL -   File "Z:\task_1528118652\mozharness\mozharness\base\script.py", line 2018, in run_action
> 14:00:41    FATAL -     self._possibly_run_method("preflight_%s" % method_name)
> 14:00:41    FATAL -   File "Z:\task_1528118652\mozharness\mozharness\base\script.py", line 1959, in _possibly_run_method
> 14:00:41    FATAL -     return getattr(self, method_name)()
> 14:00:41    FATAL -   File "Z:\task_1528118652\mozharness\mozharness\mozilla\testing\testbase.py", line 733, in preflight_run_tests
> 14:00:41    FATAL -     self._run_cmd_checks(c.get('preflight_run_cmd_suites', []))
> 14:00:41    FATAL -   File "Z:\task_1528118652\mozharness\mozharness\mozilla\testing\testbase.py", line 727, in _run_cmd_checks
> 14:00:41    FATAL -     fatal_exit_code=suite.get('fatal_exit_code', 3))
> 14:00:41    FATAL -   File "Z:\task_1528118652\mozharness\mozharness\base\script.py", line 1465, in run_command
> 14:00:41    FATAL -     self.fatal("Halting on failure while running %s" % command,
> 14:00:41    FATAL - TypeError: not all arguments converted during string formatting
> 14:00:41    FATAL - Running post_fatal callback...
> 14:00:41    FATAL - Exiting -1

Here's a task I created in the same way, but this time I didn't run the command_000000_wrapper.bat manually.

- https://tools.taskcluster.net/groups/Xs00GOpZSluO-urR-4BNpQ/tasks/Xs00GOpZSluO-urR-4BNpQ

Here, the task runs as expected. I suspect now if I run command_000000_wrapper.bat on that machine, I'll get the exception above instead of the task running.

Am I missing another step before running command_000000_wrapper?
Flags: needinfo?(pmoore)
Looking into this now...
Flags: needinfo?(pmoore)
I get the same if I don't connect with the screen resolution 1280x1024:

https://taskcluster-artifacts.net/Eij_hZaDSaqCr4CSFXbjQw/0/public/logs/log_error.log

If I open the info log instead:

https://taskcluster-artifacts.net/Eij_hZaDSaqCr4CSFXbjQw/0/public/logs/log_info.log

then I see the following at the end:

09:46:11     INFO - Copy/paste: c:\mozilla-build\python\python.exe Z:\task_1530003585\mozharness\external_tools\mouse_and_screen_resolution.py --configuration-file Z:\task_1530003585\mozharness\external_tools\machine-configuration.json
09:46:14     INFO -  Screen resolution (current): (2560, 1440)
09:46:14     INFO -  Changing the screen resolution...
09:46:14     INFO -  Screen resolution (new): (2560, 1440)
09:46:14     INFO -  Mouse position (current): (1445, 825)
09:46:14     INFO -  Mouse position (new): (1010, 10)
09:46:14     INFO -  INFRA-ERROR: The new screen resolution or mouse positions are not what we expected
09:46:14    ERROR - Return code: 1
09:46:14    ERROR - 1 not in success codes: [0]
09:46:14  WARNING - setting return code to 3
09:46:14     INFO - Running post-action listener: _package_coverage_data
09:46:14     INFO - Running post-action listener: _resource_record_post_action
09:46:14     INFO - [mozharness: 2018-06-26 09:46:14.985000Z] Finished run-tests step (failed)
09:46:14    FATAL - Uncaught exception: Traceback (most recent call last):
09:46:14    FATAL -   File "Z:\task_1530003585\mozharness\mozharness\base\script.py", line 2080, in run
09:46:14    FATAL -     self.run_action(action)
09:46:14    FATAL -   File "Z:\task_1530003585\mozharness\mozharness\base\script.py", line 2018, in run_action
09:46:14    FATAL -     self._possibly_run_method("preflight_%s" % method_name)
09:46:14    FATAL -   File "Z:\task_1530003585\mozharness\mozharness\base\script.py", line 1959, in _possibly_run_method
09:46:14    FATAL -     return getattr(self, method_name)()
09:46:14    FATAL -   File "Z:\task_1530003585\mozharness\mozharness\mozilla\testing\testbase.py", line 736, in preflight_run_tests
09:46:14    FATAL -     self._run_cmd_checks(c.get('preflight_run_cmd_suites', []))
09:46:14    FATAL -   File "Z:\task_1530003585\mozharness\mozharness\mozilla\testing\testbase.py", line 730, in _run_cmd_checks
09:46:14    FATAL -     fatal_exit_code=suite.get('fatal_exit_code', 3))
09:46:14    FATAL -   File "Z:\task_1530003585\mozharness\mozharness\base\script.py", line 1465, in run_command
09:46:14    FATAL -     self.fatal("Halting on failure while running %s" % command,
09:46:14    FATAL - TypeError: not all arguments converted during string formatting
09:46:14    FATAL - Running post_fatal callback...
09:46:14    FATAL - Exiting -1
09:46:14     INFO - Running post-run listener: _resource_record_post_run
09:46:15     INFO - Total resource usage - Wall time: 5s; CPU: 6.0%; Read bytes: 0; Write bytes: 145659904; Read time: 0; Write time: 2
09:46:15     INFO - TinderboxPrint: CPU usage<br/>5.9%
09:46:15     INFO - TinderboxPrint: I/O read bytes / time<br/>0 / 0
09:46:15     INFO - TinderboxPrint: I/O write bytes / time<br/>145,659,904 / 2
09:46:15     INFO - TinderboxPrint: CPU idle<br/>38.0 (94.1%)
09:46:15     INFO - TinderboxPrint: CPU system<br/>0.7 (1.8%)
09:46:15     INFO - TinderboxPrint: CPU user<br/>1.6 (4.1%)
09:46:15     INFO - pull - Wall time: 0s; CPU: Can't collect data; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
09:46:15     INFO - install - Wall time: 3s; CPU: 9.0%; Read bytes: 0; Write bytes: 77539840; Read time: 0; Write time: 1
09:46:15     INFO - run-tests - Wall time: 3s; CPU: 1.0%; Read bytes: 0; Write bytes: 0; Read time: 0; Write time: 0
09:46:15     INFO - Running post-run listener: copy_logs_to_upload_dir
09:46:15     INFO - Copying logs to upload dir...
09:46:15     INFO - mkdir: Z:\task_1530003585\build\upload\logs




Do you also see the following line in your log_info.log?

INFO -  INFRA-ERROR: The new screen resolution or mouse positions are not what we expected


See point 8) of https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance#For_generic-worker_10.5.0_onwards:


> 8) Connect with screen resolution 1280x1024 ! Note, it is important to use this resolution for gecko tests, since this is the screen size used by the tests, and the screen size cannot change once you have made a connection. 


If the problem isn't this, please let me know. Thanks!
Flags: needinfo?(ccorcoran)
It looks like you're right! Even though this point is emphasized in the docs, I still didn't connect it to the errors. Next time I will definitely look closer at the INFO log.

This is resolved for me.
Flags: needinfo?(ccorcoran)
No worries, glad it is working for you! :-)
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INVALID
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.