Closed Bug 1018118 Opened 10 years ago Closed 10 years ago

Growing pending queues for tegras and time between jobs per tegra > 6 hours

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: coop)

References

Details

Attachments

(2 files)

All the tegras I've looked at show the same pattern: they grab a single
job, run it to completion, and then hang for 6 hours until the slaveapi
reboot successfully resurrects them.

Here's an example tegra. Notice the ~6hr span between jobs:

https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tegra&name=tegra-101

If I look at the log for the most recent job as of now, it's hard (for
me) to tell if the reboot is working as intended:

http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Tegra%20b2g-inbound%20opt%20test%20plain-reftest-4/builds/627/steps/reboot%20device/logs/stdio

A 6hr cycle time is obviously untenable. Has anything changed in this
code recently? I took a look at the tools repo but didn't see anything.
Maybe something has changed on the networking side recently?
1) lots of useful information in the watcher.log, error.flg, even twisted.log for this tegra - which should really be available from slavehealth page (bug 976100 would fix this)

2) I could connect to the sit agent, but the test root was not available - however, error.log said it could not connect to the device, which is not true (bug 976106 would fix this)

3) test root is not available if sd card cannot be mounted, but I found a race condition that can cause this problem (bug 1010173 would fix the race condition) - so I wanted to see if this was the cause… i could test this by seeing if i can write to the sd card, even though the device says the test root is unavailable… so i tried this, and had a strange surprise!

4) when i connect to it, and ls /mnt/sdcard, i can see it is an empty directory, i double checked the test root with the testroot command and confirmed it returned the known error “##AGENT-WARNING##  unable to determine test root”. Then I tried to copy a (small) file into /mnt/sdcard to see if I could write there - and this is what happened… The copy command (cp) crashed the SUT Agent (baaaaaad bug). It didn’t just cause the client to disconnect, it actually crashed the SUT agent daemon on port 20701. But no worries - there is also a daemon on port 20700. I tried the same on this port too (notice, on port 20700 there is no prompt, so slightly harder to differentiate input/output below). This did exactly the same thing - so now I crashed but daemons for this device, and rendered it useless, until it is PDU rebooted.





[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701
Trying 10.26.85.77...
Connected to tegra-101.
Escape character is '^]'.
$>testroot
##AGENT-WARNING##  unable to determine test root
$>ls /mnt/sdcard
$>ls
dev
root
data
default.prop
init
init.goldfish.rc
init.rc
init_recovery.rc
proc
sbin
sys
system
etc
d
mnt
acct
sdcard
cache
misc
config
usbdisk
$>cat default.prop
#
# ADDITIONAL_DEFAULT_PROPERTIES
#
ro.secure=0
ro.allow.mock.location=1
ro.debuggable=1
persist.service.adb.enable=1
persist.tegra.dpy5.mode.width=1280
persist.tegra.dpy5.mode.height=720
EXTERNAL_STORAGE_MOUNT=/mnt/sdcard
ro.opengles.version=131072
dalvik.vm.heapsize=48m
$>cp default.prop /mnt/sdcard
Connection closed by foreign host.
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701
Trying 10.26.85.77...
telnet: connect to address 10.26.85.77: Connection refused
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701
Trying 10.26.85.77...
telnet: connect to address 10.26.85.77: Connection refused
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701
Trying 10.26.85.77...
telnet: connect to address 10.26.85.77: Connection refused
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701
Trying 10.26.85.77...
telnet: connect to address 10.26.85.77: Connection refused
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700
Trying 10.26.85.77...
Connected to tegra-101.
Escape character is '^]'.
20140530-00:52:29 trace output
ls /mnt/sdcard

ls
dev
root
data
default.prop
init
init.goldfish.rc
init.rc
init_recovery.rc
proc
sbin
sys
system
etc
d
mnt
acct
sdcard
cache
misc
config
usbdisk
20140530-00:53:29 Thump thump - 00:26:e8:d4:25:33
cp default.prop /mnt/sdcard/default.prop
Connection closed by foreign host.
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700
Trying 10.26.85.77...
telnet: connect to address 10.26.85.77: Connection refused
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700
Trying 10.26.85.77...
telnet: connect to address 10.26.85.77: Connection refused
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20701
Trying 10.26.85.77...
telnet: connect to address 10.26.85.77: Connection refused
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$ telnet tegra-101 20700
Trying 10.26.85.77...
telnet: connect to address 10.26.85.77: Connection refused
[cltbld@foopy116.tegra.releng.scl3.mozilla.com tegra-101]$
(In reply to Pete Moore [:pete][:pmoore] from comment #0)
> All the tegras I've looked at show the same pattern: they grab a single
> job, run it to completion, and then hang for 6 hours until the slaveapi
> reboot successfully resurrects them.
> 
> Here's an example tegra. Notice the ~6hr span between jobs:
> 
> https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.
> html?class=test&type=tegra&name=tegra-101
> 
> If I look at the log for the most recent job as of now, it's hard (for
> me) to tell if the reboot is working as intended:
> 
> http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/
> Android%202.2%20Tegra%20b2g-inbound%20opt%20test%20plain-reftest-4/builds/
> 627/steps/reboot%20device/logs/stdio
> 
> A 6hr cycle time is obviously untenable. Has anything changed in this
> code recently? I took a look at the tools repo but didn't see anything.
> Maybe something has changed on the networking side recently?

I forgot to say this was copied/pasted from an email from :coop ...
Current error.flg files I see:

75 tegras: Automation Error: Unable to connect to device after 5 attempts
49 tegras: Unknown verify failure
8 tegras: Remote Device Error: Unhandled exception in cleanupDevice
7 tegras: Remote Device Error: updateApp() call failed - exiting
5 tegras: Remote Device Error: Unable to properly remove /mnt/sdcard/tests
3 tegras: Automation Error: Unable to ping device after 5 attempts
1 tegra: Remote Device Error: unable to write to sdcard
1 tegra: Remote Device Error: failed to restore /system/etc/hosts
Of the 75 "Automation Error: Unable to connect to device after 5 attempts" I believe many may be because of bug 1010173. Testing this theory now by attempting to write to the test root, even though SUT Agent says it is unavailable.

I don't yet have an explanation for the long periods of time between taking jobs/rebooting successfully.
Attached file watcher.log-20140530
this watcher.log is quite revealing...

29/05/2014 20:27:28 - 29/05/2014 21:18:01 PDT
   ran: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20robocop-2/builds/3541
   nasty syslog: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20robocop-2/builds/3541/steps/reboot%20device/logs/stdio/text
   green

05/29/2014 21:20:01: DEBUG: Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now

It has just finished cleaning up, and making itself available.

One second later, it is given a job from the buildbot master:

29/05/2014 21:30:15	- 29/05/2014 21:51:35 PDT
   ran: http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20crashtest/builds/2875

The logs now show it is running successfully again:

   3821 2014-05-29 21:35:01 -- ###############################################################
   3822 2014-05-29 21:35:01 -- # Starting cycle for our device (tegra-101 = 10.26.85.77) now #
   3823 2014-05-29 21:35:01 -- ###############################################################
   3824 2014-05-29 21:35:01 -- contacting slavealloc
   3825 2014-05-29 21:35:01 -- buildbot pid is 18217
   3826 2014-05-29 21:35:01 -- (heartbeat) buildbot is running
   3827 2014-05-29 21:35:01 -- Cycle for our device (tegra-101) complete


   3849 2014-05-29 21:55:01 -- ###############################################################
   3850 2014-05-29 21:55:01 -- # Starting cycle for our device (tegra-101 = 10.26.85.77) now #
   3851 2014-05-29 21:55:01 -- ###############################################################
   3852 2014-05-29 21:55:01 -- contacting slavealloc
   3853 2014-05-29 21:55:01 -- Buildbot is not running

The job completes (successfully) at 21:51:35 PDT.

A few minutes later:

 401 05/29/2014 21:55:08: INFO: Uninstalling org.mozilla.fennec...
 402 05/29/2014 21:55:08: INFO: Waiting for device to come back...
 403 05/29/2014 21:57:08: INFO: Try 1
 404 05/29/2014 21:57:58: INFO: devroot None
 405 05/29/2014 21:59:58: INFO: Try 2
 406 2014-05-29 22:00:01 -- *** ERROR *** failed to aquire lockfile
 407 05/29/2014 22:00:49: INFO: devroot None
 408 05/29/2014 22:02:49: INFO: Try 3
 409 05/29/2014 22:03:39: INFO: devroot None
 410 2014-05-29 22:05:01 -- *** ERROR *** failed to aquire lockfile
 411 05/29/2014 22:05:39: ERROR: Remote Device Error: waiting for device timed out.
 412 05/29/2014 22:05:39: ERROR: Remote Device Error: waiting for device timed out.
 413 05/29/2014 22:06:29: INFO: /builds/tegra-101/error.flg
 414 05/29/2014 22:06:59: INFO: verifyDevice: failing to cleanup device

So there is a cleanup job running, which fails after the reboot.

I am quite sure this again could be the race condition from bug 1010173.

However, only a reboot can fix this.

For the next several hours, no reboot is attempted. Instead, every 5 minutes the watcher runs, sees there is an error.flg and then exits, apart from once per hour when it deletes the error.flg and retries. Each time, it will get the device root error “##AGENT-WARNING##  unable to determine test root” and so it will recreate the error.flg file, and give up again.

I think eventually, there is some generic slave rebooter, that sees - "wow - 6 hours since this machine last took a job, I'm going to reboot it". And then, hopefully all is ok again, although of course the race condition is always a possibility.

Conclusion
==========
* Necessary: This race condition needs to be fixed (bug 1010173)
* Necessary: Logging needs to be improved (don't just say error.flg exists, output what it contains)
* Necessary: Errors should not be swallowed, e.g. bug 976106 needs to be completed
* Optional:  Slave health should provide links to log files, to make life easier to troubleshoot (bug 976100)
* Necessary: We need to find out what the numerous error messages are caused by in sut tools, e.g.:
    1) http://buildbot-master88.srv.releng.scl3.mozilla.com:8201/builders/Android%202.2%20Armv6%20Tegra%20try%20opt%20test%20robocop-2/builds/3541/steps/reboot%20device/logs/stdio/text
   green
    2) 05/29/2014 21:20:01: DEBUG: Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ
    3) Maybe ok, maybe not: 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now
* Optional: until bug 1010173 is landed, consider allowing watcher to reboot device in case of persistent failure
Whoever gets to this first, please clear the review request for the other reviewer unless you think it needs more eyes.

This patch does 3 things:
1) Adds the contents of the error.flg file where pertinent in watch_devices.sh
2) Switches sut_lib to use the Linux syntax for ps. The Mac syntax was causing the "Warning: bad syntax, perhaps a bogus '-'" output, and foopies are not Macs any more.
3) Attempts a soft reboot if we cannot ping the device
Attachment #8431652 - Flags: review?(pmoore)
Attachment #8431652 - Flags: review?(bugspam.Callek)
Comment on attachment 8431652 [details] [diff] [review]
Reboot unreachable tegras

Review of attachment 8431652 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the changes I suggested :)

::: buildfarm/mobile/watch_devices.sh
@@ +58,5 @@
>        death "Not Starting, slavealloc says we're disabled" 64
>      fi
>      if [ -f /builds/$device/error.flg ]; then
>        log "error.flg file detected"
> +      local contents=`cat error.flg`

This should be:
local contents=`cat /builds/$device/error.flg`

@@ +86,5 @@
>    else # buildbot running
>      log "(heartbeat) buildbot is running"
>      if [ -f /builds/$device/error.flg ]; then
>        log "Found an error.flg, expecting buildbot to self-kill after this job"
> +      local contents=`cat error.flg`

This should be:
local contents=`cat /builds/$device/error.flg`

::: lib/python/sut_lib/__init__.py
@@ +197,5 @@
>      # 18456     1 18455 /opt/local/Libra   ??  S      0:00.88 /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python /opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin/twistd$
>      # 18575 18456 18455 /opt/local/Libra   ??  S      0:00.52
>      # /opt/local/Library/Frameworks/Python.framework/Versions/2.6/Resources/Python.app/Contents/MacOS/Python
>      # ../../sut_tools/installApp.py 10.250.49.8 ../talos-data/fennec-4.1a1pr$
> +    p, lines = runCommand(['ps', 'xU', 'cltbld', '-O', 'ppid,pgid,command'])

This looks good - did you test it?

::: sut_tools/verify.py
@@ +367,3 @@
>          log.info("verifyDevice: failing to ping")
> +        # See if we can recover the device with a reboot.
> +        soft_reboot(dm=dm, device=device)

I'll defer to Callek on this one :)
Attachment #8431652 - Flags: review?(pmoore) → review+
Callek, are you happy with the soft_reboot part?
Flags: needinfo?(bugspam.Callek)
Assignee: nobody → coop
Status: NEW → ASSIGNED
Priority: -- → P2
Callek: ping re: comment #8?
Not ideal imo. But yea let's do it
Flags: needinfo?(bugspam.Callek)
Comment on attachment 8431652 [details] [diff] [review]
Reboot unreachable tegras

Review of attachment 8431652 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/1aee4963a541
Attachment #8431652 - Flags: review?(bugspam.Callek) → checked-in+
This got deployed yesterday.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
See Also: → 1028772
There are just a couple of trailing parts left open, so reopening to track these...

* Investigate: 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now
* Until bug 1010173 is landed, consider allowing watcher to reboot device in case of persistent failure
Status: RESOLVED → REOPENED
Depends on: 1010173, 976106, 976100
Resolution: FIXED → ---
Ah - the second one is handled by coop's patch, i believe - so just the first one - but i think i can handle that with adding a bug dependency (we already have a bug about cleanup not being done - i'll find it...)
Depends on: 742479
Summary: Pending queue for tegras > 1000 and time between jobs per tegra is > 6 hours → Growing pending queues for tegras and time between jobs per tegra > 6 hours
* Investigate: 2014-05-29 21:30:14-0700 [Broker,client] I have a leftover directory 'talos-data' that is not being used by the buildmaster: you can delete it now

This has been a long standing thing, and is a directory on the foopy. its a product of how tests run for tegras, and is just informational from buildbot when it connects with/coordinates with a master.

I call this bug fixed.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: