Closed Bug 810439 Opened 12 years ago Closed 12 years ago

mozharness support for mozpool

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: jhopkins)

References

Details

(Whiteboard: [mozharness][mozpool])

Attachments

(4 files, 1 obsolete file)

http://hg.mozilla.org/build/bmm/file/default/API.txt
http://hg.mozilla.org/build/bmm/file/default/README.md

Mark says Dustin may have a staging server set up shortly, which would help development/verifying that this actually works.
You should be able to do initial development against a local copy.  The README.md pretty much explains how to do that, but Mark or I can help, too.
Attached patch small changesSplinter Review
s,bmm-inventorysync,mozpool-inventorysync,
add an argparse dependency, which mozpool-inventorysync depends on.
Attachment #680914 - Flags: review?(dustin)
Comment on attachment 680914 [details] [diff] [review]
small changes

Thanks!
Attachment #680914 - Flags: review?(dustin) → review+
Looking good!

Is mozpool-server adequate for your testing?  Currently all of the hardware in scl1 is hooked up to the same pool of devices, some of which are in production, so it could be a little tricky to get you a real-life server to play with.

When we get the power situation sorted out (just came up last night), I could probably isolate a rack for you as a different pool, where you would not worry about affecting production pandas.  Let me know if/when that becomes desirable.
(In reply to Dustin J. Mitchell [:dustin] from comment #5)
> Looking good!
> 
> Is mozpool-server adequate for your testing?  Currently all of the hardware
> in scl1 is hooked up to the same pool of devices, some of which are in
> production, so it could be a little tricky to get you a real-life server to
> play with.

Thanks!
I'm making do so far, but I've only written/tested read-only portions of the api.  I'm guessing the mozpool requests are harmless, but would like to double check.  I'm certain that at least some of the lifeguard/bmm api isn't safe to test against my laptop + production pandas.

> When we get the power situation sorted out (just came up last night), I
> could probably isolate a rack for you as a different pool, where you would
> not worry about affecting production pandas.  Let me know if/when that
> becomes desirable.

That would be cool if/when doable.  I'm not sure I need an entire rack, but if that's easiest to isolate I can expand the testing matrix if I have more pandas.  I imagine we'll need a staging pool when we're trying to get the tests up and running anyway.
Incidentally, see
  http://hg.mozilla.org/build/mozpool/file/66b87baf681b/README.md#l108
specifically the part about hitting any mobile-imaging server.  In a foopyish world, it'd probably be best for each foopy to hit its "local" imaging server first, and if that server fails then select another at random.  In a foopyless world, it'd be best to just hit them randomly.  In either case, the idea is to make sure there's an even distribution of requests across imaging servers, to limit the damage if one imaging server goes down.

We can easily configure Puppet to keep the list of imaging servers up to date.
..and the other part of the idea is that if a particular imaging server is down, work goes on without it, with foopies or buildslaves contacting other imaging servers instead, rather than failing.
The pandas in chassis 3 can/should be used as the staging server for this if needed.
Well, given I'm still using those, let me know first so I don't mess things up!
I have this about a quarter written.  John Hopkins may be taking over.
Blocks: 802317
The requests part has been fully implemented now, btw. Slight change to the API but not much. I have explained in API.txt how requests work (high-level summary: send request with details about reservation and desired image, poll request object until state changes to 'ready' or some error).

You can test this locally, although you'll have to cycle through the imaging/power-cycling tests yourself (not difficult). I am adding a section to the MozPool wiki (https://wiki.mozilla.org/index.php?title=Auto-tools/Projects/MozPool) about testing locally. Btw the wiki page is a bit out of date now; prefer the README.txt in the source for a better, more comprehensive description. I'll update the wiki page when I have some time.
In patch format for posterity.
Moved the testing section to its own page since it got long: https://wiki.mozilla.org/Auto-tools/Projects/MozPool/LocalTesting

It's still linked from the main mozpool page.
Very slight API change that doesn't look like it affects what you've written so far: rather than returning both a "request" object and "request_url" string from a device-request call, there's now just a "request" object that has an additional "url" attribute. API.txt has been updated accordingly.
Assignee: aki → jhopkins
I'm going to dupe bug 778249 to bug 777714 rather than having two releng tracking bugs.
No longer blocks: 778249
Another small modification to the API that I think makes more sense: /request/list/ now, by default, doesn't include closed requests.  /request/list/?include_closed=1 will return all.  API.txt has been updated accordingly.

There might be other small modifications as I build out the UI.  I will document them all here.  I expect to have the first version done by the end of week, after which I will endeavour to make everything backwards compatible.
A few small changes landing soon. Request "expires" attribute (from request/list/ or from device/*/request/) now in UTC (to avoid timezone madness!).  request/list/ now has "device_state" (if device has been assigned) and no longer includes "state_counters" and "state_timeout" (internal usage).
Should the milestone of this bug be "B2G C2 (20nov-10dec)"?
(In reply to Armen Zambrano G. [:armenzg] from comment #20)
> Should the milestone of this bug be "B2G C2 (20nov-10dec)"?

Yes.
I emailed Alex a list of bugs to nominate on Monday? but he's been busy with the merge + mass of emails etc.  We should proceed as if this were already marked as due Dec10.
blocking-basecamp: --- → ?
Hopefully last API change. /device/.../request/ now takes a JSON object as POST data with these keys (for b2g imaging, the only supported request atm):

'assignee': as before (free form, suggested to be email address or slave hostname)
'duration': as before, length of request in seconds
'image': must be 'b2g'
'b2gbase': URL to panda b2g build

So in other words, the changes are that 'image' is now required and, for the moment, must be 'b2g', and instead of a 'boot_config' object with 'version' and 'b2gbase' keys, it's now just a 'b2gbase' key in the main object (no 'boot_config' anymore).

This will ease future features (for Fennec automation, autophone, etc.) while hiding implementation details (PXE configs, low-level images, etc.) from test clients.

This should be deployed to production this afternoon (EST).
blocking-basecamp: ? → ---
Attachment #687265 - Flags: feedback?(dustin)
Landed http://hg.mozilla.org/build/mozharness/rev/3a7a1decf93c to silence spurious requests-caused pyflakes errors.
Comment on attachment 687265 [details] [diff] [review]
[mozharness] client-side api + unit tests, initial b2g test + config

Cool, thanks for this.
I'm writing the comments first, and can try helping fix some of these issues afterwards.

I'm r-'ing because there's a lot here to fix, and some of it will affect the rest of mozharness negatively.

However, I do want to see this fixed up and landed soon, and I'm willing to help.  I started an "aki" branch off the tip of mozpool: http://hg.mozilla.org/users/jhopkins_mozilla.com/mozharness-mozpool

Hopefully that will help ?

===

Pyflakes says:

scripts/b2g_panda.py:15: 'PythonErrorList' imported but unused
scripts/b2g_panda.py:16: 'TBPL_WARNING' imported but unused
scripts/b2g_panda.py:16: 'TBPL_RETRY' imported but unused
scripts/b2g_panda.py:16: 'TBPL_SUCCESS' imported but unused
scripts/b2g_panda.py:16: 'TBPL_FAILURE' imported but unused
scripts/b2g_panda.py:20: 'INFO' imported but unused
scripts/b2g_panda.py:20: 'OutputParser' imported but unused
scripts/b2g_panda.py:74: local variable 'c' is assigned to but never used

Pet peeve nit: could you remove trailing whitespace?  You've got quite a bit.
:%s, *$,,
will fix it in vim.

I have the following in my .vimrc:

" Use the below highlight group when displaying bad whitespace is desired
highlight BadWhitespace ctermbg=red guibg=red
" Display tabs at the beginning of a line in Python mode as bad.
au BufRead,BufNewFile *.py,*.pyw,*.cfg match BadWhitespace /^\t\+/
" Make trailing whitespace be flagged as bad.
au BufRead,BufNewFile *.py,*.pyw,*.cfg,*.c,*.h match BadWhitespace /\s\+$/

I think jlund had something that automatically removed trailing whitespace whenever he edited a file; I haven't gone that far.

>+config = {
>+    # Values for the foopies
>+    "exes": {
>+        'python': '/tools/buildbot/bin/python',
>+        'virtualenv': ['/tools/buildbot/bin/python', '/tools/misc-python/virtualenv.py'],
>+    },
>+    "virtualenv_path": "venv",

We may get this for free if you add virtualenv_config_options to self.config_options.

>-        self.debug("Temporary files: %s and %s" % (tmp_stdout_filename, tmp_stderr_filename))
>+        #XXX: changed from self.debug to self.log due to this error:
>+        #     TypeError: debug() takes exactly 1 argument (2 given)
>+        self.log("Temporary files: %s and %s" % (tmp_stdout_filename, tmp_stderr_filename))

I see you lost my pastebin; I'll explain why you hit that error below.
Sadly, it would have been a lot less work to fix this when I gave you the pastebin, rather than after you added a bunch of unittests using the wrong structure.

Ideally, we don't land this debug->log change at all.
If the time crunch makes this the sanest option, change this to
    self.log("Temporary files: %s and %s" % (tmp_stdout_filename, tmp_stderr_filename), level=DEBUG)
and we'll need a followup bug to revert this once the unittests are fixed.

>+def check_mozpool_status(status):
>+    if not mozpool_status_ok(status):
>+        if status == 409:
>+            raise MozpoolConflictException()
>+        import pprint
>+        raise Exception('mozpool status not ok, code %s' % pprint.pformat(status))

Calling an uncaught Exception here would seem to make this rather fragile.  Is this mainly for debugging purposes or are we planning to keep it this way?

We could change this to an UnknownMozpoolException or something if we want to catch it and throw an error.

If we actually want to exit the process at this point, I prefer self.fatal() over raising an exception, unless we explicitly want to be able to catch that exception downstream.  We're catching MozpoolConflictException; we're not catching the Exception.

self.fatal() allows us to log the message appropriately; you could take advantage of it if you moved check_mozpool_status() into MozpoolHandler and started calling self.check_mozpool_status().

>diff --git a/mpunit.sh b/mpunit.sh

At some point in the future, I'd like this testable standalone, without having to be networked and/or having mozpool-server running locally.  We'll need to have a mock response to do this, most likely.  But for this first pass, this was the perfect approach.

>+#TODO - adjust these values
>+MAX_RETRIES = 20
>+RETRY_INTERVAL = 30
>+REQUEST_DURATION = 60 * 20

At some point we may want this in a config file, or overrideable via config/commandline, but this is fine for now.

>+
>+class PandaTest(TestingMixin, BaseScript, VirtualenvMixin, MozpoolMixin, LogMixin):

You don't need LogMixin (BaseScript already inherits it).
You will need BuildbotMixin for self.read_buildbot_config(), self.set_buildbot_property(), and self.buildbot_status().

>+    virtualenv_modules = [
>+        'requests',
>+    ]

We may need mozdevice to use devicemanager.py for SUT calls, but this is fine for now.

>+        assert self.mozpool_device is not None, '--mozpool-device is required'
>+        self.mozpool_assignee = self.config.get('mozpool_assignee')
>+        if self.mozpool_assignee is None:
>+            self.mozpool_assignee = self.buildbot_config.get('properties')["slavename"]
>+        assert self.mozpool_assignee is not None, '--mozpool-assignee is required'

Could you change these asserts to self.fatal(message) ?

>+    def query_abs_dirs(self):
>+        if self.abs_dirs:
>+            return self.abs_dirs
>+        abs_dirs = super(PandaTest, self).query_abs_dirs()
>+        dirs = {}
>+        for key in dirs.keys():
>+            if key not in abs_dirs:
>+                abs_dirs[key] = dirs[key]
>+        self.abs_dirs = abs_dirs
>+        return self.abs_dirs

We don't need this method redefined here at all if we're not going to define any directories, but I'm fine leaving it til we know for sure.

>+    def _build_arg(self, option, value):
>+        """
>+        Build a command line argument
>+        """
>+        if not value:
>+            return []
>+        return [str(option), str(value)]

We don't seem to be using this at all, but maybe when we populate run_tests()

>+    def _force_device_state(self, device, assignee, duration, desired_state):
>+        mph = self.query_mozpool_handler()
>+        dd = mph.query_device_status(device)
>+        old_state = dd['state']
>+        mph.device_state_change(device, assignee, duration, old_state, desired_state)

Same here.

>+    def retry_sleep(self, sleep_time=1, max_retries=5, error_message=None):
>+        for x in range(1, max_retries + 1):
>+            yield x
>+            sleep(sleep_time)
>+        if error_message:
>+            self.log(error_message, ERROR)

I think you either mean self.log(error_message, level=ERROR) or self.error(error_message).

>+        raise Exception('Retries limit exceeded')

This should be a fatal().
We may want to do something like self.buildbot_status(TBPL_RETRY) as well if that's the behavior we want from buildbot.

>+            except MozpoolConflictException:
>+                self.log("Device unavailable.  Retry#%i.." % retry, WARNING)

I think you mean self.warning(...) or level=WARNING

>+
>+        self.request_url = response['request']['url']
>+        self.log("Got request, url=%s" % self.request_url)

I'm going to guess this needs to be self.info()

>+            if state == 'ready':
>+                return
>+            self.log("Waiting for request 'ready' stage.  Current state: '%s'" % state)

self.info()

>+    def close_request(self):
>+        mph = self.query_mozpool_handler()
>+        mph.close_request(self.request_url)
>+        print("Request '%s' deleted on cleanup" % self.request_url)
>+        self.request_url = None

>+# MozpoolTest {{{1
>+class MozpoolTest(VirtualenvMixin, MozpoolMixin, BaseScript):

Do you need this anymore?
If not we can nuke it. (scripts/mozpooltest.py)

>+class BaseMozpoolTest(unittest.TestCase, VirtualenvMixin, MozpoolMixin, BaseScript):

The reason you're having problems with debug() is unittest.TestCase has its own debug() that's conflicting.  You need to change this to only be a unittest.TestCase.

You can then set self.mozpool_script to a BaseScript object using MozpoolMixin.

To make the move less painful, you can then

    self.mph = self.mozpool_script.mph

This would have been easiest to do early on, rather than after all of these had been written to use the wrong structure.  I can try to help here.

There are also a lot of unused imports etc here.
Attachment #687265 - Flags: review?(aki) → review-
http://hg.mozilla.org/users/jhopkins_mozilla.com/mozharness-mozpool/rev/8907c9fc1da3 may have fixed some of your unittest issues.

However, I'm not passing my unittests locally; not sure if that's because my db is set up differently than yours, or if I just broke something.
Attachment #687265 - Attachment is obsolete: true
Attachment #687265 - Flags: feedback?(dustin)
Attachment #687985 - Flags: review?(aki)
Comment on attachment 687985 [details] [diff] [review]
[mozharness] updated patch

>+    def wait_for_request_ready(self):
>+        mph = self.query_mozpool_handler()
>+        for retry in self.retry_sleep(sleep_time=RETRY_INTERVAL, max_retries=MAX_RETRIES,
>+                error_message="INFRA-ERROR: Request did not become ready in time",
>+                tbpl_status=TBPL_RETRY):
>+            response = mph.query_request_status(self.request_url)
>+            state = response['state']
>+            if state == 'ready':
>+                return
>+            self.log("Waiting for request 'ready' stage.  Current state: '%s'" % state)

self.info() ?
Attachment #687985 - Flags: review?(aki) → review+
Comment on attachment 687985 [details] [diff] [review]
[mozharness] updated patch

Landed in http://hg.mozilla.org/build/mozharness/rev/da5349059059
Attachment #687985 - Flags: checked-in+
Keeping this bug open as we will need some more client-side changes in response to server-side updates in bug 817035.
Depends on: 817035
Attachment #689285 - Flags: review?(dustin) → review+
Comment on attachment 689285 [details] [diff] [review]
[mozharness] add 'environment' to request_device()

Landed in http://hg.mozilla.org/build/mozharness/rev/c3e91b76976e
Attachment #689285 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → Mozharness
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: