Closed Bug 1018531 Opened 10 years ago Closed 10 years ago

Panda tests retrying more than necessary

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: pmoore)

References

Details

(Whiteboard: [capacity][mobile][panda])

Attachments

(1 file, 1 obsolete file)

We started updating the tools checkout on the foopies during reconfigs as part of bug 1018248 today. This is absolutely the right thing to do and should help us not get out of sync there again, but it's had some unintended consequences this first time around.

Looking at the results for panda jobs after the tools checkout was updated on the foopies today, we see an immediate increase in the number of retries:

Before update:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=ef6fc5965a4c

After update:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=9b01df568861

Diving into the "after" tests for Android 4.0 mochist-3 opt (the first example where we have multiple failures followed by a successful test), the "failed" examples all contain a log line like this:

08:49:33     INFO -  Caught Exception: Remote Device Error: unable to connect to panda-0434 after 5 attempts

From: https://tbpl.mozilla.org/php/getParsedLog.php?id=40735272&tree=Mozilla-Inbound&full=1#error0

This error doesn't seem to be fatal at all. Indeed, the rest of the mochitest runs to completion afterwards but the error parser flags it as needing a retry:

https://hg.mozilla.org/build/buildbotcustom/file/b047f08ca8b4/status/errors.py#l9

In this particular case, we "failed" 4 times at 23 minutes a piece before finally succeeding. Every one of those tests actually succeeded. :(

I know we're surfacing more exceptions now in an effort to get them fixed. This seems like one of the first cases we need to fix.

We could change the error parser while we work on a proper fix. Testing in my python console, I think a regexp like this should work in status/errors.py#l9:

re.compile("(?<!INFO -  Caught Exception: )Remote Device Error")

Pete: assigning this to you because one way of another you'll need to deal with this as buildduty on Monday. You also understand the problem space better than most.
Tomcat|sheriffduty
01:42:25 PDT: see https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=fd9518344a23
01:42:27 PDT: the blue ones
01:42:47 PDT: all different pandas with Caught Exception: Remote Device Error: unable to connect to panda-0627 after 5 attempts etc
01:44:12 PDT: and seems a general problem when i load https://tbpl.mozilla.org/?tree=Mozilla-Inbound
01:44:17 PDT: lots of blue :)
So the tools version deployed before was 774756f24bef. The current deployed version is 9fdad0dcf1bc. Just checking the delta for obvious changes that could affect.
Nothing obviously suspect at first glance. E.g for sut tools, the only change is:

$ git diff 2b990522b8fa9714ff0afb925fb725f190408c36..28e463a044aaea3f650ca8a07f10a969acb3ea1b -- sut_tools
diff --git a/sut_tools/mozdevice/devicemanagerSUT.py b/sut_tools/mozdevice/devicemanagerSUT.py
index 1733e49..00685b9 100644
--- a/sut_tools/mozdevice/devicemanagerSUT.py
+++ b/sut_tools/mozdevice/devicemanagerSUT.py
@@ -991,7 +991,8 @@ class DeviceManagerSUT(DeviceManager):
         else:
             try:
                 data = self._runCmds([{'cmd': 'testroot'}])
-            except:
+            except BaseException as e:
+                print "Caught Exception:", e
                 return None
 
             deviceRoot = data.strip() + '/tests'
$

And this should be harmless. In any case, I'll check the logs to see if we hit this "Caught Exception" line...
So I have a first suspicion!!

I *think* what might be the problem, is that previously, we were swallowing this exception, and not reporting on it. Now we still don't process it (we have the except clause) but we do mention that we caught this error in the logs. We didn't used to do that.

I *think* the system that decides this was a "blue build" and needs retrying is actually parsing our logs, and seeing the word "Exception" and therefore thinking that this build has failed, when in fact it has passed.

This is just a hypothesis at this point... Will investigate further.
Test output below. You see *with* the "Caught Exception:" no SRE_Match object is returned (i.e. the line is not matched) but without the "Caught Exception:" the SRE_Match object is returned, as required.


Python 2.7.6rc1 (v2.7.6rc1:4913d0e9be30+, Oct 27 2013, 20:52:11) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> print re.search("(?<!INFO -  Caught Exception: )Remote Device Error", "09:26:32     INFO -  Caught Exception: Remote Device Error: unable to connect to panda-0361 after 5 attempts")
None
>>> print re.search("(?<!INFO -  Caught Exception: )Remote Device Error", "09:26:32     INFO -  Remote Device Error: unable to connect to panda-0361 after 5 attempts")
<_sre.SRE_Match object at 0x10047db28>
>>> quit()
Attachment #8432414 - Flags: review?(sbruno)
Actually, let me add a comment too - to explain this strange line :D
Attachment #8432414 - Attachment is obsolete: true
Attachment #8432414 - Flags: review?(sbruno)
Attachment #8432418 - Flags: review?(sbruno)
Attachment #8432418 - Flags: review?(sbruno) → review+
Blocks: 1019038
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
In production
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: