Closed Bug 1595279 Opened 5 years ago Closed 3 years ago

No gecko-t-win64-aarch64-laptop workers are active and taking jobs, big backlog for Windows AArch laptops

Categories

(Infrastructure & Operations :: RelOps: Windows OS, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Unassigned)

References

Details

Attachments

(5 files)

Less than 5 out of 25 gecko-t-win64-aarch64-laptop workers are active and taking jobs, big backlog for Windows AArch laptops.

https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-win64-aarch64-laptop&from=now-24h&to=now&refresh=1m

There was some work on those yesterday. Rob, can you provide any details about that and what needs to be done to get the worker pool back to normal?

Flags: needinfo?(rthijssen)

No Windows 10 AArch laptops are running, https://firefox-ci-tc.services.mozilla.com/provisioners/bitbar/worker-types/gecko-t-win64-aarch64-laptop lists only 2 machines which both took their last job 13 hours ago.

Flags: needinfo?(wcosta)
Summary: Less than 5 out of 25 gecko-t-win64-aarch64-laptop workers are active and taking jobs, big backlog for Windows AArch laptops → No gecko-t-win64-aarch64-laptop workers are active and taking jobs, big backlog for Windows AArch laptops

i'm currently working on this with bitbar and pmoore.

there are/were many issues:

  • most machines were dead and not logging to papertrail. this has been partially resolved by bitbar who have rebooted many instances. the running list of instances that are no longer dead and are actively logging to papertrail is at: https://github.com/mozilla-releng/OpenCloudConfig/tree/master/keys
  • all machines were again running outdated generic worker versions. most were on the same variant of v14 but a few were running different versions. this has now been corrected for instances that are awake, running occ and logging to papertrail. those instances are now on gw 16.5.1
  • all machines had different and confusing gw configurations. many had config files that had been modified badly by a powershell script which had added a whole bunch of object type information when it had tried to edit config settings like workerId, publicIP and other instance specific settings. this has been resolved by causing the instances to generate an rsa key and then publish their public keys. the public keys have been added to occ and used to generate and publish encrypted gw configurations that are specific to each worker. occ then checks the gw config on each boot and overwrites it with the corrected/encrypted config.
  • bitbar seem to have created a user called "testdroid" on each of these laptops and each instance started throwing errors when generic worker ran to say that the running testdroid session conflicted with the task user session. this caused generic-worker to panic and reboot. i rectified this by causing occ to disable the testdroid account and change its credentials to prevent it from logging in on the next reboot.
  • the latest error we see is from generic worker and reads:
 *********** PANIC occurred! ***********  
WTSQueryUserToken: An attempt was made to reference a token that does not exist. 
Exiting worker with exit code 69 

pmoore and i are currently debugging this to see if we can get past it and get the machines to take tasks.

Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
Flags: needinfo?(rthijssen)

we haven't yet found a resolution to get the yogas taking tasks. the machines are mostly up and running (with the exception of 002, 004, 005, 008, 013, 019, 028, 030, 031, 032, 035, which i have asked bitbar to reboot).

the working machines logs can be seen at: https://my.papertrailapp.com/events?q=system%3At-lenovoyogac630
the patches described in comment 2 are triggered here: https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/MaintainSystem.ps1#L146-L164
maintain system runs at each instance boot and invokes the patches in the gist. for instances that have already been patched, they don't do much more than just log their state. when the instances in the first line of this comment come online, they will begin to patch themselves as well however, they won't be fully operational until their public keys have been added to the encryption list allowing them to collect their patched and encrypted gw configs. i will monitor for these instances coming to life when bitbar reboots them and add their keys.

the elephant in the room is that even once they are patched, they won't take tasks until we understand what is causing the generic-worker panic.

my suspicion is that arm64 windows instances have installed a windows update which prevents the gw task user creation routine from running as intended and as used to work fine. if this suspicion is correct, the only solutions i can think of are to somehow roll back the windows update that caused the problem or to patch generic-worker to use a different mechanism when creating task users. of course, we first need to establish what is causing the problem and if it was actually a windows update.

two more yogas came online this morning (028 and 035).
that leaves 9 machines still down: 002, 004, 005, 008, 013, 019, 030, 031, 032

i added some debugging log messages to understand the windows version and patch levels and i can see that most of these machines are on different patch levels and some even have completely different windows builds. this leads me to believe that a windows update was not responsible for the bustage since all yogas are affected but many yogas do not have a recent windows update.

  • t-lenovoyogac630-003: 10.0.18362 (bios v2.06 6/4/2019)
    • kb4508433 9/10/2019
    • kb4515383 9/11/2019
    • kb4516115 9/11/2019
    • kb4520390 10/4/2019
    • kb4521863 10/9/2019
    • kb4517389 10/9/2019
  • t-lenovoyogac630-006: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4485449 7/13/2019
    • kb4493478 7/13/2019
    • kb4497398 7/13/2019
    • kb4497932 7/13/2019
    • kb4503308 7/13/2019
    • kb4509094 7/13/2019
    • kb4512501 8/13/2019
  • t-lenovoyogac630-007: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4485449 8/28/2019
    • kb4497398 8/28/2019
    • kb4497932 8/28/2019
    • kb4503308 8/28/2019
    • kb4509094 8/28/2019
    • kb4507435 8/28/2019
  • t-lenovoyogac630-010: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 3/18/2019
    • kb4485449 3/25/2019
    • kb4497398 5/17/2019
    • kb4503308 6/13/2019
    • kb4509094 7/10/2019
    • kb4512501 8/13/2019
  • t-lenovoyogac630-012: 10.0.18362 (bios v2.06 6/4/2019)
    • kb4516115 10/31/2019
    • kb4521863 10/31/2019
    • kb4517389 10/31/2019
  • t-lenovoyogac630-015: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 3/18/2019
    • kb4485449 3/25/2019
    • kb4497398 5/14/2019
    • kb4503308 6/12/2019
    • kb4509094 7/11/2019
    • kb4512501 8/14/2019
  • t-lenovoyogac630-018: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 3/19/2019
    • kb4485449 3/25/2019
    • kb4497398 5/15/2019
    • kb4503308 6/11/2019
    • kb4509094 7/10/2019
    • kb4512501 8/14/2019
  • t-lenovoyogac630-022: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 3/18/2019
    • kb4485449 3/25/2019
    • kb4497398 5/15/2019
    • kb4503308 6/13/2019
    • kb4509094 7/10/2019
    • kb4512501 8/13/2019
  • t-lenovoyogac630-024: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 7/13/2019
    • kb4485449 7/13/2019
    • kb4497398 7/13/2019
    • kb4503308 7/13/2019
    • kb4509094 7/13/2019
    • kb4512501 8/14/2019
  • t-lenovoyogac630-025: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 7/13/2019
    • kb4485449 7/13/2019
    • kb4497398 7/13/2019
    • kb4503308 7/13/2019
    • kb4509094 7/13/2019
    • kb4516115 10/14/2019
    • kb4512501 8/13/2019
  • t-lenovoyogac630-026: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 7/11/2019
    • kb4485449 7/11/2019
    • kb4493478 7/11/2019
    • kb4497398 7/11/2019
    • kb4497932 7/11/2019
    • kb4503308 7/11/2019
    • kb4509094 7/11/2019
    • kb4512501 8/14/2019
  • t-lenovoyogac630-029: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 3/19/2019
    • kb4485449 3/25/2019
    • kb4493478 4/10/2019
    • kb4497398 5/15/2019
    • kb4503308 6/13/2019
    • kb4509094 7/10/2019
    • kb4512501 8/14/2019
  • t-lenovoyogac630-033: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 7/14/2019
    • kb4485449 7/14/2019
    • kb4497398 7/14/2019
    • kb4503308 7/14/2019
    • kb4509094 7/14/2019
    • kb4512501 8/14/2019
  • t-lenovoyogac630-035: 10.0.17134 (bios v1.06 10/25/2018)
    • kb4456655 2/21/2019
    • kb4485449 5/13/2019
    • kb4497398 5/17/2019
    • kb4503308 6/12/2019
    • kb4509094 7/10/2019
    • kb4512501 8/13/2019
Flags: needinfo?(wcosta)

i also tried rolling back generic worker to the last known good version (14.1.2) on these instances but the result was the same. generic worker panics with the same error messages.

i have now seen that most of the windows 10 hardware is busted with a similar generic-worker panic.

the error message on non-yogas about the interactive session is the same as we got on yogas before we disabled the testdroid user account.

pete, we're going to need your help to debug this as the panic occurs in gw and i think i have exhausted debug options around operating system updates or changes.

Flags: needinfo?(pmoore)

There is https://bugzilla.mozilla.org/show_bug.cgi?id=1547965

Which had a similar error with generic-worker 14.1.0 :

May 13 12:30:40 T-W1064-MS-016.mdc1.mozilla.com generic-worker: Making system call WTSGetActiveConsoleSessionId with args: []#015
May 13 12:30:40 T-W1064-MS-016.mdc1.mozilla.com generic-worker: Result: 1 0 The operation completed successfully.#015
May 13 12:30:40 T-W1064-MS-016.mdc1.mozilla.com generic-worker: Making system call WTSQueryUserToken with args: [1 C0420786D0]#015
May 13 12:30:40 T-W1064-MS-016.mdc1.mozilla.com generic-worker: Result: 0 0 An attempt was made to reference a token that does not exist.#015

The issues was the deployment was touching registry settings in HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\Winlogon\ . This was odd because it worked previously without issue. (solved in https://bugzilla.mozilla.org/show_bug.cgi?id=1547965#c10 but not the same here).

From Bitbar Slack:

Stanley Lao 12:22 PM
The laptops are setup to auto login through netplwiz with the user Testdroid. The laptops were running fine before the switch(boot into task user). I do not know why they are not booting into the task users. (edited)
Mark Cornmesser 12:25 PM
Can we remove it from a yoga and see what happens?
And when we say auto-login,is that OS wise?
Stanley Lao 12:26 PM
Testdroid user is the only admin account on the devices.
Yes. OS wise.

Screenshot from Bitbar. It does appear that testdroid is touching values here.

I had Bitbar remove the testdroid user off of 032 and the win logon registry key. In Both cases the nodes stopped outputting to papertrail.

It is also worth noting that Bitbar will rdp into some of the laptops. I explained to them that VNC maybe the better method for remote access. That could also explain some other oddities we may see from testdroid user.

I don't have an answer to when these will be back on line. From what I have seen today I think there maybe some contention between registry settings that generic-woker needs to set and what is set by Bitbar using netplwiz. I don't know why this is an issue now. I am also unsure what the path forward would be.

I am going to move the ni over to grenade.

Flags: needinfo?(mcornmesser) → needinfo?(rthijssen)

:bc, could you disable these from running by default on m-c/try (try only with --full)? This way our queues don't build up and we can fix this and turn these back on when ready.

Flags: needinfo?(bob)

The requirements of generic-worker are that it can set the WinLogon registry settings to enable autologon as a task user.

It sets the following registry keys:

  • HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon\AutoAdminLogon
  • HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon\DefaultUserName
  • HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Winlogon\DefaultPassword

If this is not enough to cause the new task user to automatically logon after a reboot, then generic-worker will not function correctly.

Flags: needinfo?(pmoore)
Flags: needinfo?(bob)
Pushed by bclary@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b582b54b01d4
disable windows10-aarch64 on mozilla-central and restrict try to --full, r=jmaher.
See Also: → 1594403
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

i've reopened since we still need to get these machines taking tasks.

this morning stanley at bitbar managed to get a yoga to take a task by manually setting the winlogon registry keys to their expected task user settings and rebooting. the instance successfully completed a task and then reverted to the old panic behaviour on its next task run iteration where the registry was not manually updated.

this prompted pete and i to add some debugging showing the settings for current/next task user and the default username in the registry.

we observed that my home edition of windows 10 on yoga does not include a "remote desktop users" group which is something gw requires. when i added that group i started seeing the token panics in gw.

pete is working with this information to reproduce the token error with a minimal go program that creates users in the way that gw does. this should allow us to isolate the problem we see on this architecture.

Status: RESOLVED → REOPENED
Flags: needinfo?(rthijssen)
Resolution: FIXED → ---

we've managed to get tasks running by using a slightly older generic worker (14.1.2) in combination with a patch that resets gw (delete json files and reg keys) whenever a panic is detected.

pete is still working to reproduce the token issue with minimal go code in order to patch a newer release of gw.

I've managed to reproduce the issue on a yoga laptop here outside of OCC just in generic-worker.

For some reasons the calls to update the winlogon registry keys are successful, yet for some reason the registry does not seem to get updated.

I've reproduced this with generic-worker 16.5.5 and am now checking that 14.1.2 doesn't have this problem.

Assuming that is the case, I will be installing go directly on the laptop and rebuilding both 14.1.2 and 16.5.5 with extra debugging, to try to work out why one works and the other doesn't. It is quite bizarre, because the releases were built with identical go versions (go 1.10.8) and the code that modifies registry hasn't changed in around 4-5 years, and also the problem only occurs on aarch64 machines, not on any other windows workers we have that happily run generic-worker 16.

I'll report back when I have more results.

there are currently 11 bitbar yoga instances online, taking tasks and logging to papertrail. which means there are 24 instances that have not started taking tasks since we figured out how to work around the current issues.

instances that are not sending logs to papertrail, have to be rebooted by bitbar. if that doesn't get them logging to papertrail, then the occ trigger command needs to be run by bitbar from an admin powershell prompt.

$gitBranchOrRef = 'master'
Invoke-Expression (New-Object Net.WebClient).DownloadString(('https://raw.githubusercontent.com/mozilla-releng/OpenCloudConfig/{0}/userdata/rundsc.ps1?{1}' -f $gitBranchOrRef, [Guid]::NewGuid()))

then we have to get the instance public key into occ at https://github.com/mozilla-releng/OpenCloudConfig/tree/master/keys. only instances that have been remastered or don't already have a public key in occ need to have this done.
we get the public key from the instance logs at https://my.papertrailapp.com/events?q=system%3Ayoga%20OccReset%20PGP
once the key is in occ we have to generate encrypted generic worker configs for the instance with https://github.com/mozilla-releng/OpenCloudConfig/blob/master/ci/generate-encrypted-config.sh push the gpg files to github occ at: https://github.com/mozilla-releng/OpenCloudConfig/tree/master/cfg/generic-worker

when all that has been done, the instance should start taking task and show up at: https://firefox-ci-tc.services.mozilla.com/provisioners/bitbar/worker-types/gecko-t-win64-aarch64-laptop

I've managed to reproduce the issue on an aarch64 windows laptop here, and discovered some interesting things:

  1. Like Rob says, win10 on aarch64 seems not to have a Remote Desktop Users group, but that can be manually created

  2. The auto logon is not working, also for older versions of generic-worker, like 14.1.2

  3. The reason tasks run under generic-worker 14.1.2 despite the automatic login not working, are that:
    a) generic-worker 14.1.2 does not check that the logon session of the interactive user is for the task user it created - it just waits until the interactive desktop session is active, and assumes the user is the correct one, unlike generic-worker 16.5.5 which asserts that the user is the correct one.
    b) Presumably, the testdroid user that was set up on the bitbar laptops was set up so that there was an interactive desktop session available for running tasks

    To test this theory, I submitted a task, which confirmed that indeed all tasks are running as the testdroid OS user rather than the task user they should use (see log lines 22/23).

  4. After adding debug to the generic-worker logs, I can see that generic-worker successfully writes winlogon registry keys, and after rebooting and reading the registry keys back, it finds the same values it wrote to the registry before the reboot. However, for some reason, when I look in the registry with regedit, or using reg query on the command line, the keys that generic-worker wrote appear not to be in the registry. It is like generic-worker is writing them to a shadow registry file, not the one that the system is using. I have no explanation for this at the moment, since there should only be a single HKEY_LOCAL_MACHINE registry, and I run regedit/reg query from the same command shell as I run the go program that reads/writes to HKEY_LOCAL_MACHINE, yet they show different results. I intend to troubleshoot this further by patching the go standard library with additional debug output to log all the syscalls it makes.

In summary, generic-worker 14.1.2 runs, but isn't operating correctly since it runs all tasks as the same OS user (testdroid). Version 16 doesn't run, because it is stricter that generic-worker 14.1.2 and checks that the user it is running tasks as is the user that it created for the purpose. The reason the user isn't auto-logging in is because the updates to the winlogon registry settings from the go code that generic-worker calls, appear to be operating against a different registry to the actual registry that the system is using. This only happens on our aarch64 windows infrastructure - all our amd64 and i386 windows workers, for Windows 7, Windows 10, Windows Server 2012 R2, all operate as expected, and don't seem to have the same "shadow registry" issue that the aarch64 Windows 10 workers seem to have.

The investigation continues...

  1. After adding debug to the generic-worker logs, I can see that generic-worker successfully writes winlogon registry keys, and after rebooting and reading the registry keys back, it finds the same values it wrote to the registry before the reboot. However, for some reason, when I look in the registry with regedit, or using reg query on the command line, the keys that generic-worker wrote appear not to be in the registry. It is like generic-worker is writing them to a shadow registry file, not the one that the system is using. I have no explanation for this at the moment, since there should only be a single HKEY_LOCAL_MACHINE registry, and I run regedit/reg query from the same command shell as I run the go program that reads/writes to HKEY_LOCAL_MACHINE, yet they show different results. I intend to troubleshoot this further by patching the go standard library with additional debug output to log all the syscalls it makes.

So typically the on disk location of the registry is in C:\Windows\System32\config. I wonder if because of the different architecture the Yoga laptops registry on disk location is C:\Windows\SysArm32\config. Both of these location exist on the Yogas. Interestingly enough, if cmd is open and the where reg command is ran it comes back to C\Windows\System32\reg.exe. I am guessing that the path is picking up reg in system32, and that reg command is then modifying the registry files in C:\Windows\System32 and thus having no impact on the laptops' configurations.

The solution maybe calling the full path of the C:\Windows\SysArm32\reg.exe command or altering the local path to have C:\Windows\SysArm32 instead of C:\Windows\System32.

(In reply to Mark Cornmesser [:markco] from comment #22)

  1. After adding debug to the generic-worker logs, I can see that generic-worker successfully writes winlogon registry keys, and after rebooting and reading the registry keys back, it finds the same values it wrote to the registry before the reboot. However, for some reason, when I look in the registry with regedit, or using reg query on the command line, the keys that generic-worker wrote appear not to be in the registry. It is like generic-worker is writing them to a shadow registry file, not the one that the system is using. I have no explanation for this at the moment, since there should only be a single HKEY_LOCAL_MACHINE registry, and I run regedit/reg query from the same command shell as I run the go program that reads/writes to HKEY_LOCAL_MACHINE, yet they show different results. I intend to troubleshoot this further by patching the go standard library with additional debug output to log all the syscalls it makes.

So typically the on disk location of the registry is in C:\Windows\System32\config. I wonder if because of the different architecture the Yoga laptops registry on disk location is C:\Windows\SysArm32\config. Both of these location exist on the Yogas. Interestingly enough, if cmd is open and the where reg command is ran it comes back to C\Windows\System32\reg.exe. I am guessing that the path is picking up reg in system32, and that reg command is then modifying the registry files in C:\Windows\System32 and thus having no impact on the laptops' configurations.

The solution maybe calling the full path of the C:\Windows\SysArm32\reg.exe command or altering the local path to have C:\Windows\SysArm32 instead of C:\Windows\System32.

Thanks Mark for highlighting the two different config paths on the file system.

Indeed I was able to see that when I ran C:\Windows\System32\regedt32.exe I see different settings to when I ran C:\Windows\SysArm32\regedt32.exe - however, neither had the DefaultUserName / DefaultPassword settings that the go program was setting, and was able to read.

In the end I was able to search for the values in the registry and discovered that the keys had been written to:

  • HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\Microsoft\Windows NT\CurrentVersion\Winlogon

This led me to the registry redirector which has some special notes:

Windows 10 on ARM: In addition to the 32-bit logical view for x86 applications, Windows 10 on ARM includes a separate logical view for 32-bit ARM applications.

From digging deeper, it looks like we may need to access an alternate registry view when updating the winlogon registry keys. Since this is done in the go standard library, we will need to create our own library to do this.

An alternative is that call out to the reg system command to perform the updates.

Note, this may arguably be a bug in Windows 10 on ARM, since on 386/amd64 platforms, it seems the winlogon keys are shared and not reflected.

I'll see which is easier/simpler to implement.

But at least the mystery is solved. :-)

I'm not sure if this is going to break things for generic-worker on Windows 7 32 bit (386) edition, but we'll find out in testing.

If it does break things, then we still have the option of writing explicitly to the 32 bit registry view AND the 64 bit registry view, but let's try this first.

Assignee: rthijssen → pmoore
Status: REOPENED → ASSIGNED
Attachment #9109012 - Flags: review?(rthijssen)
Comment on attachment 9109012 [details] [review]
GitHub Pull Request for generic-worker

lgtm
Attachment #9109012 - Flags: review?(rthijssen) → review+

Released in generic-worker 16.5.6.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED

:gbrown, could you re-enable tests on win/aarch64? These were disabled a month ago in comment 15

Status: RESOLVED → REOPENED
Flags: needinfo?(gbrown)
Resolution: FIXED → ---
Flags: needinfo?(gbrown)
Attachment #9115670 - Attachment description: Bug 1595279 - Start running windows/aarch64 tests on mozilla-central again; r= → Bug 1595279 - Start running windows/aarch64 web-platform tests on mozilla-central again; r=
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/79ea5e387456
Start running windows/aarch64 web-platform tests on mozilla-central again; r=jmaher
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Just documenting in the bug that only wpt were re-enabled at this time due to capacity constraints. There's more discussion in D57051.

Joel, how are things looking now? Are the laptops running efficiently, or are we still losing workers? I'm wondering if we need to open a separate bug to increase pool size.

Is there any work remaining for this bug?

Thanks!

Flags: needinfo?(jmaher)

we don't have realistic options to increase the pool size, just to make sure what equipment we have is online as much as possible.

I see 15 online now and that seems to be stable, likewise our queues are not too bad, so I propose doing more work here to adjust scheduling:

  1. remove raptor-browsertime tests that are running tier-3 on win/aarch64 for autoland: speedometer, tp6-1-p
  2. run reftest, crashtest and mochitest-media on m-c pushes
  3. determine if we have perf needs for talos or raptor tests. I suspect that the talos/raptor data for win/aarch64 is very similar if not identical to win10, so having no aarch64 coverage is low risk. If we determine certain tests are useful to run, then we should schedule them.

:gbrown, can you take care of #1 and #2
:davehunt, can you take care of #3

Flags: needinfo?(jmaher)
Flags: needinfo?(gbrown)
Flags: needinfo?(dave.hunt)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #33)

  1. determine if we have perf needs for talos or raptor tests. I suspect that the talos/raptor data for win/aarch64 is very similar if not identical to win10, so having no aarch64 coverage is low risk. If we determine certain tests are useful to run, then we should schedule them.

The results for windows10-aarch64 are indeed very different from windows10-64, as can be seen from the graph at https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=mozilla-central,2006452,1,10&series=mozilla-central,1969770,1,10&timerange=5184000. I wasn't aware that we hadn't had results for the last month. This makes it difficult to show the series in Perfherder, so I don't have a comparison for Talos. I'm not aware of needs changing around this platform, but I can find out if these tests are still needed.

Flags: needinfo?(dave.hunt)

that link is not comparing the same test, here is the link:
https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=mozilla-central,2006452,1,10&series=mozilla-central,2006533,1,10&timerange=7776000

it is different, but improvements and regressions seem to be identical, the pattern is the same, but the baseline value and scale of change is different.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #35)

it is different, but improvements and regressions seem to be identical, the pattern is the same, but the baseline value and scale of change is different.

Thanks, I was struggling with Perfherder as I was unable to select data from the platform (potentially as there are no recent results). I filed bug 1604871 for this. I've checked with :vchin and :esmyth, and they both feel that we should continue to run performance test on this platform while it remains tier 1 in https://developer.mozilla.org/en-US/docs/Mozilla/Supported_build_configurations

which performance tests should we run? talos, raptor, all? we don't have a lot of available machine time, so if we can reduce perf that will help. Our unittests have been significantly reduced already.

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #37)

which performance tests should we run? talos, raptor, all? we don't have a lot of available machine time, so if we can reduce perf that will help. Our unittests have been significantly reduced already.

Has available machine time changed since these tests were first enabled, or did we underestimate how much was needed? I'm open to suggestions for what tests we should skip, although we were already only running these on mozilla-central. Can something like SETA help here at all? I would say Raptor, AWSY and Talos startup tests would all be candidates for inclusion. Within Raptor we can limit the pageload tests to cold, and potentially further limit to a subset but that might become difficult to maintain until we have a smart way to identify these.

I need to determine original capacity vs current capacity, it seems lower now.

SETA is currently only applicable to autoland, I think we should shoot for:

  • awsy
  • talos startup (ts_paint, sessionrestore*)
  • raptor (cold for tp6)

these will be m-c only and on try with --full.

:gbrown, could you look into adding the above perf to be scheduled as well?

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #39)

  • raptor (cold for tp6)

Please include the benchmarks and youtube media playback (dropped frame) tests.

Stop running integration branch browsertime tasks on windows10-aarch64;
these were added recently in bugs 1585013 and bug 1604113 but I don't
think there was any specific consideration of windows10-aarch64.
Restore mochitest-media and all reftest tests on mozilla-central only
on windows10-aarch64.

Comment 41 is intended to address #1 and #2 of comment 33.

Flags: needinfo?(gbrown)
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/66954a0b1144
Adjust tasks run on windows aarch64; r=jmaher
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED

leave open to get more perf jobs turned on, ensure load is ok, and consider more unittests.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Regressions: 1605069

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #45)

leave open to get more perf jobs turned on, ensure load is ok, and consider more unittests.

We have no perf results for windows10-aarch64 since November. What is the current load? Can we start re-enabling perf tests?

Flags: needinfo?(jmaher)

we intended to not run perf tests on win/aarch64- the goal here is to make sure we keep a build and some of the more risky stuff running green, but nothing else. In the event we find that our userbase increases, we will re-evaluate our testing strategy on win/aarch64 which will include the hardware, OS version, and what we run.

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #47)

we intended to not run perf tests on win/aarch64-

That contradicts your earlier comment of "leave open to get more perf jobs turned on". As Windows on ARM64 is a tier 1 platform, could we enable some minimal perf tests? Those mentioned in comment 39 would be a good start.

that was 7 months ago (december 2019), we have evaluated the market place and our commitment to win/aarch64 and determined in order to save costs and engineering efforts in 2020 we will do as little as possible on win/aarch64.

After chatting to :jmaher in Matrix, we've agreed that we have capacity to run at least one performance test. With advice from :mconley we've selected startup_about_home_paint.

Not actively working on this right now.

Assignee: pmoore → nobody

Hey Dave, shall we resolve this WONTFIX or do we still want to pursue this? Thanks!

Flags: needinfo?(dave.hunt)

I feel we can close this as fixed as there were several changes and I believe the issue was resolved.

Status: REOPENED → RESOLVED
Closed: 4 years ago3 years ago
Flags: needinfo?(dave.hunt)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: