Closed Bug 1182532 Opened 9 years ago Closed 8 years ago

archiver is broken - mozharness can't be deployed on slaves

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: jlund)

References

Details

Attachments

(4 files)

related:

09:03 <vinh> hi releng folks we're getting alerted with:  rabbit1.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (735) messages_ready CRITICAL (655), messages_unacknowledged OK (80) (http://m.mozilla.org/Rabbit+Unread+Messages)
09:30 <vinh> catlee: should I purge those messages in /relengapi-celery

09:14 <RyanVM|sheriffduty> big pile o' infra exceptions https://treeherder.mozilla.org/logviewer.html#?job_id=945580&repo=mozilla-aurora

looks like archiver won't complete any calls that use relengapi's celery as a background task. currently looking for initial error that wedged us.
Severity: normal → blocker
dustin restarted celeryd ~10:10 PT

    10:07 <dustin> restarting celeryd
    10:12 <dustin> message count is dropping
    10:12 <dustin> so something in celery got wedged :/
11:27 <dustin> jlund: btw, I increased teh concurrency of the celery worker to 40 (from 20)
12:07 <jlund> dustin: rabbitmq message count: 11:53 <vinh> jlund: 2,572 msg   alert still hasn't cleared
12:08 <dustin> yeah, they're all in those response queues
12:09 <dustin> something's still broken
12:09 <dustin> when you trigger the celery jobs, there's some flag you set to say "I'm not waiting for the answer"
12:09 <dustin> are you setting that?
12:09 <dustin> why would releases_mozillabetabd50a25b0ff3.tar.gz_testing_mozharness have 634 *anything*?
12:10 <dustin> are there really that many requests for that single tarball?
12:10 <dustin> all within the time it takes to download one copy and upload to S3?
12:10 <dustin> something's really not right here..
12:11 <jlund> dustin: with retries and if tasks were wedged, maybe the count could be that high.
12:12 <jlund> dustin: I wonder if this logic is the culprit: if create_and_upload_archive.AsyncResult(task_id).state != 'STARTED':
12:13 <dustin> maybe
12:13 <dustin> I have no idea how celery gets that state
12:13 <dustin> I *assume* it's a dynamic request to the celery nodes
12:14 <dustin> since there's no other place to store state
12:14 <dustin> those queues are filled from the 'celeryresults' exchange
12:18 <dustin> jlund: ah, yeah, I bet STARTED is wrong, because I bet there's some pre-started state
12:18 <dustin> in which you don't want to call apply_async again
12:19 <jlund> dustin: that's what I'm thinking
12:19 <dustin> but also, apply_async -- I think -- waits for a response
12:19 <dustin> which is probably where those never-read messages are coming from
12:20 <dustin> hm, callek does the same for slaveloan
12:20 <dustin> 'course that's not in production so maybe it's not right
12:20 <jlund> so just quickly coming back to this code, I suppose if the state was "PENDING" that could be 'a task doesn't exist by this id" or "a task exists, it just hasn't been picked up"
12:20 <dustin> I don't know
12:20 <dustin> I don't think Celery could distinguish those two cases
12:21 <jlund> if so, we are probably spinning up tasks with the same task id
12:21 <dustin> https://zapier.com/blog/async-celery-example-why-and-how/ suggests that calling AsyncResult(task_id) starts the task
12:21 <dustin> yeah
12:22 <dustin> I think that's one problem
12:22 <dustin> but I also think you're not collecting the results of any tasks
12:22 <jlund> well that's what my task_status(task_id) is doing
12:23 <dustin> http://celery.readthedocs.org/en/latest/userguide/tasks.html#ignore-results-you-don-t-want
12:23 <dustin> true
12:24 <dustin> so either that's not effective
12:24 <dustin> or 634 times for mozilla_release-beta, the job requested an archive, then stopped polling before the archive task was done
12:25 <jlund> sigh. I guess I have to roll this back
12:25 <dustin> Celery might have some timeout mechanism to clean up un-gathered responses -- this has to be a common issue
12:26 <dustin> it's possible that's a 1-day timeout or something, so we just didn't hit it here
12:26 <dustin> it might be worth pinging w.lach on monday to see if he has some more context
12:26 <dustin> I wish I knew more about this stuff, sorry
12:27 <dustin> I'll delete the biggest queues now
12:27 <jlund> dustin: no worries. I'll roll this all back and reconfig. I think using sql might be better for tracking state
this is backed out now. trees should be using mozharness.json again. trees are re-opened
Assignee: nobody → jlund
Severity: blocker → normal
Status: NEW → ASSIGNED
Blocks: 1154795
# from PR

This combats against creating multiple celery tasks for the same job and manages clean up. It does this by:

    setting a time limit of how long a task can run for (protecting from getting hung tasks)
    setting an expiry of how long a task can be un-actioned in the message queue
    tracking the state of celery tasks with a db backend. ** Prior to this I only tracked if the state given by celery was 'STARTED'. But celery tasks could also be PENDING or RETRY. You can't trust PENDING as that's the default state for tasks that were never even created.
    explicitly capping the revision at 12 chars.
        If a buildbot job or user hits the archiver api with both the short hash and the long hash, archiver used to think this was two separate archives. This could be an issue if a bunch of buildbot jobs have the short hash stored in buildbot properties.
    some house keeping with the trackers in the db. If the task goes past it's time limit or is in a finished state and the tracker still exists in the db, a badpenny job will clean it up.
    add more logging to help track health of archiver

Though much more work than simply: if create_and_upload_archive.AsyncResult(task_id).state != 'STARTED', managing the tasks through a tracker should provide us with much more control.
Attachment #8633804 - Flags: review?(dustin)
Comment on attachment 8633804 [details] [review]
https://github.com/mozilla/build-relengapi/pull/296

Looks good, but some comments to address in the PR
Attachment #8633804 - Flags: review?(dustin) → feedback+
Attachment #8633804 - Flags: review?(dustin) → review+
relengapi staging and prod has been updated to 3.1.2 with https://github.com/mozilla/build-relengapi/pull/296 in it
Depends on: 1185507
once this is resolved, we should reverse the patch that added a fallback to hgmo in https://bugzil.la/1181331

the patch:
https://bugzilla.mozilla.org/attachment.cgi?id=8635625
we have fixed up a lot of errors that caused perma-bad csets here: https://bugzilla.mozilla.org/show_bug.cgi?id=1185507#c6

however, I came across a recent job that was still timing out. Seems like there is still an issue.

e.g.
treeherder: https://treeherder.mozilla.org/#/jobs?repo=try&revision=56272fbd3fb5
rev: 56272fbd3fb5

the first build to poll archiver was this one: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nfitzgerald@mozilla.com-56272fbd3fb5/try-linux64-asan-debug/try-linux64-asan-debug-bm75-try1-build856.txt.gz

you can see it realizes the s3 url doesn't exist so it creates a celery task. However it just stays pending and times out:
2015-07-24 10:59:02,183 Setting DEBUG logging.
2015-07-24 10:59:02,183 attempt 1/10
2015-07-24 10:59:02,183 Getting archive location from https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness
2015-07-24 10:59:03,528 attempt 1/10
2015-07-24 10:59:04,123 current task status: no status available at this point. state: PENDING
2015-07-24 10:59:04,123 sleeping for 10.00s (attempt 1/10)

all of the subsequent builds of this changeset behaved slightly different (client logs): http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nfitzgerald@mozilla.com-56272fbd3fb5/try-android-api-11/try-android-api-11-bm76-try1-build1397.txt.gz

bash -c 'python /tools/checkouts/build-tools/buildfarm/utils/archiver_client.py mozharness --repo try --rev 56272fbd3fb5a400eb778bda1f3cc208d79becc0 --destination scripts --debug'
 in dir /builds/slave/try-l64-0000000000000000000000/. (timeout 1200 secs)
2015-07-24 11:03:16,608 truncating revision to first 12 chars
2015-07-24 11:03:16,608 Setting DEBUG logging.
2015-07-24 11:03:16,608 attempt 1/10
2015-07-24 11:03:16,608 Getting archive location from https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness
2015-07-24 11:03:17,530 sleeping for 11.00s (attempt 1/10)
2015-07-24 11:03:28,541 attempt 2/10
...ect
2015-07-24 11:10:58,476 Getting archive location from https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness
2015-07-24 11:10:59,360 sleeping for 99.00s (attempt 9/10)
2015-07-24 11:12:38,459 attempt 10/10
2015-07-24 11:12:38,459 Getting archive location from https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness
2015-07-24 11:12:39,366 Could not get a valid response from archiver.
Traceback (most recent call last):
  File "/tools/checkouts/build-tools/buildfarm/utils/archiver_client.py", line 153, in get_url_response
    response = urllib2.urlopen(api_url, timeout=60)
  File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 429, in error
    result = self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 616, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib64/python2.6/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib64/python2.6/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
2015-07-24 11:12:39,368 attempt 1/3
2015-07-24 11:12:39,368 Trying src_url directly: https://hg.mozilla.org/try/archive/56272fbd3fb5.tar.gz/testing/mozharness
2015-07-24 11:12:41,816 unpacking tar archive at: try-56272fbd3fb5/testing/mozharness/
program finished with exit code 0

notice above, archiver tries to return an s3 url however it doesn't exist and we raise a 404 every time the client polls 'https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness'

client context: here(http://mxr.mozilla.org/build/source/tools/buildfarm/utils/archiver_client.py#153) raises line 164

relengapi context: https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L186 

meanwhile in relengapi logs:
/relengapi.log.18:2015-07-24 10:59:03,463 [relengapi.blueprints.archiver] Creating new celery task and task tracker for: try-56272fbd3fb5.tar.gz_testing_mozharness
./relengapi.log.18:2015-07-24 10:59:03,774 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s
./relengapi.log.18:2015-07-24 10:59:14,527 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state PENDING
./relengapi.log.18:2015-07-24 10:59:15,267 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s
./relengapi.log.18:2015-07-24 10:59:15,668 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s
./relengapi.log.18:2015-07-24 10:59:17,118 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s
./relengapi.log.18:2015-07-24 10:59:31,145 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state PENDING
./relengapi.log.18:2015-07-24 10:59:33,055 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s
.... # these get urls are repeated dozens of times
./relengapi.log.18:2015-07-24 11:00:35,588 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s
./relengapi.log.18:2015-07-24 11:00:35,893 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state PENDING

notice there are no errors above. we log about the intent to create the task and tracker, then at least one host polls the status url location (the first log url above) and it reports as pending. The other requests return generated signed urls for the s3 location but the s3 url doesn't actually exist (the ones that received 404s)


meanwhile in celery logs, we can see that the celery task is never even created/started initially at 11:00 PT as the first mention of 56272fbd3fb5 is at 12:18 PT:
[2015-07-24 12:18:04,675: INFO/MainProcess] Received task: relengapi.blueprints.badpenny.execution._run_job[b40ca441-1d8d-4f1e-9bdc-0094a7805d75]
[2015-07-24 12:18:04,705: INFO/Worker-11] Running task u'relengapi.blueprints.archiver.cleanup_old_tasks' id 20372
[2015-07-24 12:18:04,744: INFO/Worker-11] deleting tracker with id: try-56272fbd3fb5.tar.gz_testing_mozharness
[2015-07-24 12:18:04,773: INFO/MainProcess] Task relengapi.blueprints.badpenny.execution._run_job[b40ca441-1d8d-4f1e-9bdc-0094a7805d75] succeeded in 0.0952629260719s: None
[2015-07-24 12:18:56,658: INFO/MainProcess] Received task: relengapi.blueprints.archiver.tasks.create_and_upload_archive[try-56272fbd3fb5.tar.gz_testing_mozharness] expires:[2015-07-24 19:48:56.592159+00:00]
[2015-07-24 12:18:56,702: INFO/Worker-17] Key to be uploaded to S3: try-56272fbd3fb5.tar.gz/testing/mozharness - Verifying src_url: https://hg.mozilla.org/try/archive/56272fbd3fb5.tar.gz/testing/mozharness
[2015-07-24 12:18:58,865: INFO/Worker-17] Key to be uploaded to S3: try-56272fbd3fb5.tar.gz/testing/mozharness - downloading and unpacking archive from src_url
[2015-07-24 12:19:02,595: INFO/MainProcess] Task relengapi.blueprints.archiver.tasks.create_and_upload_archive[try-56272fbd3fb5.tar.gz_testing_mozharness] succeeded in 5.93358509988s: {'status': "Task completed! Check 's3_urls' for upload locations.", 's3_urls': {'us-west-2':...


the first time that 56272fbd3fb5 is logged is when we are cleaning up the tracker over an hour later thus proving that at least the tracker was added to the db and seemingly, a task was created: https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L177

however, it's was only after badpenny rm'd the tracker that we created a new celery task and actually uploaded to s3 successfully:

back at relengapi logs later on after badpenny cleaned up the tracker, everything works fine again:

/relengapi.log.9:2015-07-24 12:17:15,184 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s
./relengapi.log.9:2015-07-24 12:18:56,591 [relengapi.blueprints.archiver] Creating new celery task and task tracker for: try-56272fbd3fb5.tar.gz_testing_mozharness
./relengapi.log.9:2015-07-24 12:18:56,863 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state STARTED
./relengapi.log.9:2015-07-24 12:19:08,269 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state SUCCESS
./relengapi.log.9:2015-07-24 12:19:08,270 [relengapi.blueprints.archiver] deleting tracker with id: try-56272fbd3fb5.tar.gz_testing_mozharness


my conclusions:

1) on the first few builds at 1100PT we try to create a tracker and task but subsequent jobs are evaluating False for one of these two conditions:
    - https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L161
    - https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L173
2) none of the first few builds at 1100PT ever seem to have a celery task that starts otherwise we would see log lines like the following around 1100PT in celery1:
    - https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/tasks.py#L26 
3) after an hour, we clean up the tracker (and celery/rabbit would remove any expired tasks), and somehow relengapi stops reporting that the s3 url exists so we create another new task here (which completes fine this time)
4) one obvious bug is that we shouldn't be able to make it past https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L173 without returning an error. However, if badpenny deletes a tracker that expired at 12:18, how are we skipping L173 in the first few builds?

</ braindump>
> 4) one obvious bug is that we shouldn't be able to make it past
> https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/
> archiver/__init__.py#L173 without returning an error. However, if badpenny
> deletes a tracker that expired at 12:18, how are we skipping L173 in the
> first few builds?

I read L173 backwards. we do have a tracker, so we should be returning a 202 with the location for where the task is running. Which means, I should be doing:

diff --git a/relengapi/blueprints/archiver/__init__.py b/relengapi/blueprints/archiver/__init__.py
index c50a645..9f3933d 100644
--- a/relengapi/blueprints/archiver/__init__.py
+++ b/relengapi/blueprints/archiver/__init__.py
@@ -160,31 +160,31 @@ def get_archive(src_url, key, preferred_region):
     # first, see if the key exists
     if not s3.get_bucket(bucket).get_key(key):
         task_id = key.replace('/', '_')  # keep things simple and avoid slashes in task url
         # can't use unique support:
         # api.pub.build.mozilla.org/docs/development/databases/#unique-row-support-get-or-create
         # because we want to know when the row doesn't exist before creating it
         tracker = tables.ArchiverTask.query.filter(tables.ArchiverTask.task_id == task_id).first()
         if tracker and tracker.state in FINISHED_STATES:
             log.info('Task tracker: {} exists but finished with state: '
                      '{}'.format(task_id, tracker.state))
             # remove tracker and try celery task again
             delete_tracker(tracker)
             tracker = None
         if not tracker:
             log.info("Creating new celery task and task tracker for: {}".format(task_id))
             task = create_and_upload_archive.apply_async(args=[src_url, key], task_id=task_id)
             if task and task.id:
                 session.add(tables.ArchiverTask(task_id=task.id, created_at=now(),
                                                 src_url=src_url, state="PENDING"))
                 session.commit()
-                return {}, 202, {'Location': url_for('archiver.task_status', task_id=task.id)}
             else:
                 return {}, 500
+        return {}, 202, {'Location': url_for('archiver.task_status', task_id=task.id)}

     log.info("generating GET URL to {}, expires in {}s".format(key, GET_EXPIRES_IN))
     # return 302 pointing to s3 url with archive
     signed_url = s3.generate_url(
         method='GET', expires_in=GET_EXPIRES_IN,
         bucket=bucket, key=key
     )
     return redirect(signed_url)
so now the question is, why is the original archiver call here:

http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nfitzgerald@mozilla.com-56272fbd3fb5/try-linux64-asan-debug/try-linux64-asan-debug-bm75-try1-build856.txt.gz

/relengapi.log.18:2015-07-24 10:59:03,463 [relengapi.blueprints.archiver] Creating new celery task and task tracker for: try-56272fbd3fb5.tar.gz_testing_mozharness

somehow able to create a tracker row if the celery task never actually starts: https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L176

I bet rabbit logs at 11:00 PT will give more insight for rev: 56272fbd3fb5 but I don't have access to the rabbit web view.
one solution if this just means creating celery tasks are not 100% reliable: simply say a celery task for  Mozharness should only take 2min. If the task is not completed by then, create a new task.

We are already cleaning up celery tasks that are hung (time_limit) or still in rabbit queue (expires) so it shouldn't be that bad to create another one here and there.
r+ to the diff in comment 12.

There's not a lot to love in the logs:

=INFO REPORT==== 24-Jul-2015::10:58:05 ===
accepting AMQP connection <0.20123.3706> (10.22.81.209:16866 -> 10.22.81.92:5672)

=INFO REPORT==== 24-Jul-2015::10:59:03 ===
accepting AMQP connection <0.20766.3706> (10.22.81.209:24111 -> 10.22.81.92:5672)

=WARNING REPORT==== 24-Jul-2015::10:59:19 ===
closing AMQP connection <0.17630.3706> (10.22.81.209:50454 -> 10.22.81.92:5672):
connection_closed_abruptly

=INFO REPORT==== 24-Jul-2015::10:59:32 ===
accepting AMQP connection <0.21068.3706> (10.22.81.209:29109 -> 10.22.81.92:5672)

=INFO REPORT==== 24-Jul-2015::10:59:32 ===
accepting AMQP connection <0.21074.3706> (10.22.81.209:29112 -> 10.22.81.92:5672)

=WARNING REPORT==== 24-Jul-2015::10:59:37 ===
closing AMQP connection <0.17817.3706> (10.22.81.209:52735 -> 10.22.81.92:5672):
connection_closed_abruptly

=INFO REPORT==== 24-Jul-2015::11:00:15 ===
accepting AMQP connection <0.21553.3706> (10.22.81.209:35374 -> 10.22.81.92:5672)

=WARNING REPORT==== 24-Jul-2015::11:00:58 ===
closing AMQP connection <0.18698.3706> (10.22.81.209:63655 -> 10.22.81.92:5672):
connection_closed_abruptly

=INFO REPORT==== 24-Jul-2015::11:01:09 ===
accepting AMQP connection <0.22129.3706> (10.22.81.209:42948 -> 10.22.81.92:5672)

=WARNING REPORT==== 24-Jul-2015::11:01:49 ===
closing AMQP connection <0.19212.3706> (10.22.81.209:5362 -> 10.22.81.92:5672):
connection_closed_abruptly

=INFO REPORT==== 24-Jul-2015::11:03:07 ===
accepting AMQP connection <0.23395.3706> (10.22.81.209:58479 -> 10.22.81.92:5672)

... and that connection closed quite a bit later ...

=WARNING REPORT==== 24-Jul-2015::11:48:55 ===
closing AMQP connection <0.20766.3706> (10.22.81.209:24111 -> 10.22.81.92:5672):
connection_closed_abruptly

With the change away from tracking results as queues, rabbit itself shows very little of interest in the web view.  The `celery` queue is empty, so the message isn't stuck in there somewhere.  However, looking in the DB:

mysql> select * from celery_taskmeta where task_id like '%56272fbd3fb5%' \G
*************************** 1. row ***************************
       id: 46
  task_id: try-56272fbd3fb5.tar.gz_testing_mozharness
   status: SUCCESS
   result: �}q(UstatusqU5Task completed! Check 's3_urls' for upload locations.qUs3_urlsq}q(U    us-west-2qU�https://mozilla-releng-usw2-archiver.s3-us-west-2.amazonaws.com/try-56272fbd3fb5.tar.gz/testing/mozharness?Signature=59441kNImqdKgWXeJFRCZVir8oo%3D&Expires=1437765841&AWSAccessKeyId=AKIAILEHYUYSBPB5N3DQU        us-east-1qU�https://mozilla-releng-use1-archiver.s3.amazonaws.com/try-56272fbd3fb5.tar.gz/testing/mozharness?Signature=LC%2FUiw8bqaCP%2Fa2nBx5Pbc15E8k%3D&Expires=1437765842&AWSAccessKeyId=AKIAILEHYUYSBPB5N3DQuUsrc_urlXI   https://hg.mozilla.org/try/archive/56272fbd3fb5.tar.gz/testing/mozharnessq      u.
date_done: 2015-07-24 19:19:03
traceback: NULL

that date_done is UTC, equivalent to 12:19:03 PT.  So that represents the second run.  The closest thing time-wise to the first run is

mysql> select * from celery_taskmeta where date_done = '2015-07-24 18:06:04' \G
*************************** 1. row ***************************
       id: 19
  task_id: try-52d47fd24448.tar.gz_testing_mozharness
   status: SUCCESS
   result: �}q(UstatusqU5Task completed! Check 's3_urls' for upload locations.qUs3_urlsq}q(U    us-west-2qU�https://mozilla-releng-usw2-archiver.s3-us-west-2.amazonaws.com/try-52d47fd24448.tar.gz/testing/mozharness?Signature=jb5Ilp5PH4sKSmLxPe1fkmzqBWE%3D&Expires=1437761462&AWSAccessKeyId=AKIAILEHYUYSBPB5N3DQU        us-east-1qU�https://mozilla-releng-use1-archiver.s3.amazonaws.com/try-52d47fd24448.tar.gz/testing/mozharness?Signature=4kbcG%2ByP%2F2tflpTWlz2OtTsZsh8%3D&Expires=1437761463&AWSAccessKeyId=AKIAILEHYUYSBPB5N3DQuUsrc_urlXI   https://hg.mozilla.org/try/archive/52d47fd24448.tar.gz/testing/mozharnessq      u.
date_done: 2015-07-24 18:06:04
traceback: NULL
thanks for poking at this

(In reply to Dustin J. Mitchell [:dustin] from comment #15)
> r+ to the diff in comment 12.

sweet!

> 
> =INFO REPORT==== 24-Jul-2015::11:03:07 ===
> accepting AMQP connection <0.23395.3706> (10.22.81.209:58479 ->
> 10.22.81.92:5672)
> 
> ... and that connection closed quite a bit later ...
> 
> =WARNING REPORT==== 24-Jul-2015::11:48:55 ===
> closing AMQP connection <0.20766.3706> (10.22.81.209:24111 ->
> 10.22.81.92:5672):
> connection_closed_abruptly

so it sounds like the way things are configured, there is a reliability issue. Some examples I found for intermittent perma PENDING
   - https://github.com/celery/celery/issues/1042 - not enough disk space on the rabbit hosts
   - https://groups.google.com/forum/#!topic/celery-users/Xt9mjO7ZpQc - having to set strange celery configuration

> 
> The `celery` queue is empty, so
> the message isn't stuck in there somewhere.

hrm, I wonder if it never made it or maybe was cleaned up after it expired.

> However, looking in the DB:
> 
so if the db is only tracking task results, and the first celery task never started, wouldn't there never be a record in the db anyway right?

Unless you stop me, I'm going to pursue a change where we re-create the task if we are still PENDING after X time (say 60s). If we have STARTED or we are in the middle of RETRY, keep the task going. This doesn't address the occasional unreliability between relengapi <-> archiver <-> celery <-> rabbit but it would remove our reliance on falling back to hgmo when we get stuck like this
camd,

We're seeing some instability with celery (specifically tasks not starting) among other issues. We switched to a DB-based results backend instead of a rabbit/amqp based.

I'm n-i'ing you to see if you saw anything similar with treeherder, and/or for some insight into what you guys may be doing differently with that so we can try and get our end fixed.

Any ideas welcome. (I'd rather fix the cause not the symptom)
Flags: needinfo?(cdawson)
I know this is all overkill. Having a results tracker for celery tasks means something is likely broken or that celery is not the right tool for my needs. But I am hoping the added complexity can at least unblock us from our reliance on falling back to hgmo when archiver celery tasks get in a perma-pending state
Attachment #8639393 - Flags: review?(dustin)
Comment on attachment 8639393 [details] [review]
adds pending expiry time to archiver celery tracker

I'm not sure the logic is correct (maybe I'm just missing something)
Attachment #8639393 - Flags: review?(dustin)
Comment on attachment 8639393 [details] [review]
adds pending expiry time to archiver celery tracker

I get it now :)
Attachment #8639393 - Flags: review+
the band-aid seems to have helped:

you can see here we have a task stuck pending

after a min, we delete the tracker and trigger another celery task:
2015-07-28 15:05:33,111 [relengapi.blueprints.archiver] Creating new celery task and task tracker for: try-a25b10547463.tar.gz_testing_mozharness
2015-07-28 15:05:33,580 [relengapi.blueprints.archiver] checking status of task id try-a25b10547463.tar.gz_testing_mozharness: current state PENDING

2015-07-28 15:06:35,056 [relengapi.blueprints.archiver] checking status of task id try-a25b10547463.tar.gz_testing_mozharness: current state PENDING
2015-07-28 15:06:35,211 [relengapi.blueprints.archiver] Task try-a25b10547463.tar.gz_testing_mozharness has expired from pending too long. Re-creating task
2015-07-28 15:06:35,212 [relengapi.blueprints.archiver] renewing tracker 3435 with pending expiry: 2015-07-28 22:07:35.212706+00:00
2015-07-28 15:06:53,778 [relengapi.blueprints.archiver] checking status of task id try-a25b10547463.tar.gz_testing_mozharness: current state SUCCESS

from the db side: https://pastebin.mozilla.org/8840887

so I think we are good here. *however* this happens way too often. Not sure on exact numbers yet but 1 out of every dozen or few dozen celery tasks never actually leave PENDING state

It would be great to find out why...
(In reply to Justin Wood (:Callek) from comment #18)
> camd,
> 
> We're seeing some instability with celery (specifically tasks not starting)
> among other issues. We switched to a DB-based results backend instead of a
> rabbit/amqp based.
> 
> I'm n-i'ing you to see if you saw anything similar with treeherder, and/or
> for some insight into what you guys may be doing differently with that so we
> can try and get our end fixed.
> 
> Any ideas welcome. (I'd rather fix the cause not the symptom)

callek-- We have had stability issues with Celery at times.  Most of ours were due to restarting during deployments.  It sometimes leaves Zombie processes going.  

Early on, we started using the ``maxtasksperchild`` setting: http://celery.readthedocs.org/en/latest/userguide/workers.html#max-tasks-per-child-setting.  This helped some of our issues and seemed to provide better longevity.  There is a memory leak somewhere in our data ingestion, but we were never able to track it down.

I can't think of anything more than that.  Though I admit I don't grok everything in this bug.  It's a biggie... :)  Looping in mdoglio in case he has any more info.
Flags: needinfo?(cdawson) → needinfo?(mdoglio)
we are ready to try removing the need to fall-back on hgmo thanks to https://bugzilla.mozilla.org/show_bug.cgi?id=1185507#c6 and https://bugzilla.mozilla.org/show_bug.cgi?id=1182532#c22

rail, this reverses the patch I had you review here: https://bug1181331.bmoattachments.org/attachment.cgi?id=8635625

it also lets any exceptions raised on retry attempts to be logged after each attempt so we know earlier in the log if something is wrong.
Attachment #8640576 - Flags: review?(rail)
Attachment #8640576 - Flags: review?(rail) → review+
Comment on attachment 8640576 [details] [diff] [review]
150729_1182532_archiver_intermittent_pending-rms_dep_on_hgmo_fallback-tools.patch

ta

https://hg.mozilla.org/build/tools/rev/301ea12efbad
Attachment #8640576 - Flags: checked-in+
I have never experienced a situation where a task is not picked up by a worker. Do you have a task monitor showing the list of tasks stuck in pending state?
Flags: needinfo?(mdoglio)
I have something similar in bug 1190405. Signing workers don't use celery, but they use kobmu, which is used by celery. Setting explicitly the heartbeat interval helped there. You can try setting  http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker--heartbeat-interval
The heartbeat setting is being addressed in bug 1190408.

Mauro, thanks for the info.  What do you use for monitoring?  I've been using the celery command (wrapped to set up the environment):

[root@celery1.srv.releng.scl3 relengapi]# ./run-celery.sh inspect active
-> celery@celery1.srv.releng.scl3.mozilla.com: OK
    - empty -

but it has never really given me useful data.
sorry wrong bug.
(:dgarvey is referring to a comment he deleted)
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: