Closed
Bug 1182532
Opened 9 years ago
Closed 8 years ago
archiver is broken - mozharness can't be deployed on slaves
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlund, Assigned: jlund)
References
Details
Attachments
(4 files)
related: 09:03 <vinh> hi releng folks we're getting alerted with: rabbit1.releng.webapp.scl3.mozilla.com:Rabbit Unread Messages is CRITICAL: RABBITMQ_OVERVIEW CRITICAL - messages CRITICAL (735) messages_ready CRITICAL (655), messages_unacknowledged OK (80) (http://m.mozilla.org/Rabbit+Unread+Messages) 09:30 <vinh> catlee: should I purge those messages in /relengapi-celery 09:14 <RyanVM|sheriffduty> big pile o' infra exceptions https://treeherder.mozilla.org/logviewer.html#?job_id=945580&repo=mozilla-aurora looks like archiver won't complete any calls that use relengapi's celery as a background task. currently looking for initial error that wedged us.
Updated•9 years ago
|
Severity: normal → blocker
Assignee | ||
Comment 2•9 years ago
|
||
dustin restarted celeryd ~10:10 PT 10:07 <dustin> restarting celeryd 10:12 <dustin> message count is dropping 10:12 <dustin> so something in celery got wedged :/
Assignee | ||
Comment 3•9 years ago
|
||
11:27 <dustin> jlund: btw, I increased teh concurrency of the celery worker to 40 (from 20)
Assignee | ||
Comment 4•9 years ago
|
||
12:07 <jlund> dustin: rabbitmq message count: 11:53 <vinh> jlund: 2,572 msg alert still hasn't cleared 12:08 <dustin> yeah, they're all in those response queues 12:09 <dustin> something's still broken 12:09 <dustin> when you trigger the celery jobs, there's some flag you set to say "I'm not waiting for the answer" 12:09 <dustin> are you setting that? 12:09 <dustin> why would releases_mozillabetabd50a25b0ff3.tar.gz_testing_mozharness have 634 *anything*? 12:10 <dustin> are there really that many requests for that single tarball? 12:10 <dustin> all within the time it takes to download one copy and upload to S3? 12:10 <dustin> something's really not right here.. 12:11 <jlund> dustin: with retries and if tasks were wedged, maybe the count could be that high. 12:12 <jlund> dustin: I wonder if this logic is the culprit: if create_and_upload_archive.AsyncResult(task_id).state != 'STARTED': 12:13 <dustin> maybe 12:13 <dustin> I have no idea how celery gets that state 12:13 <dustin> I *assume* it's a dynamic request to the celery nodes 12:14 <dustin> since there's no other place to store state 12:14 <dustin> those queues are filled from the 'celeryresults' exchange 12:18 <dustin> jlund: ah, yeah, I bet STARTED is wrong, because I bet there's some pre-started state 12:18 <dustin> in which you don't want to call apply_async again 12:19 <jlund> dustin: that's what I'm thinking 12:19 <dustin> but also, apply_async -- I think -- waits for a response 12:19 <dustin> which is probably where those never-read messages are coming from 12:20 <dustin> hm, callek does the same for slaveloan 12:20 <dustin> 'course that's not in production so maybe it's not right 12:20 <jlund> so just quickly coming back to this code, I suppose if the state was "PENDING" that could be 'a task doesn't exist by this id" or "a task exists, it just hasn't been picked up" 12:20 <dustin> I don't know 12:20 <dustin> I don't think Celery could distinguish those two cases 12:21 <jlund> if so, we are probably spinning up tasks with the same task id 12:21 <dustin> https://zapier.com/blog/async-celery-example-why-and-how/ suggests that calling AsyncResult(task_id) starts the task 12:21 <dustin> yeah 12:22 <dustin> I think that's one problem 12:22 <dustin> but I also think you're not collecting the results of any tasks 12:22 <jlund> well that's what my task_status(task_id) is doing 12:23 <dustin> http://celery.readthedocs.org/en/latest/userguide/tasks.html#ignore-results-you-don-t-want 12:23 <dustin> true 12:24 <dustin> so either that's not effective 12:24 <dustin> or 634 times for mozilla_release-beta, the job requested an archive, then stopped polling before the archive task was done 12:25 <jlund> sigh. I guess I have to roll this back 12:25 <dustin> Celery might have some timeout mechanism to clean up un-gathered responses -- this has to be a common issue 12:26 <dustin> it's possible that's a 1-day timeout or something, so we just didn't hit it here 12:26 <dustin> it might be worth pinging w.lach on monday to see if he has some more context 12:26 <dustin> I wish I knew more about this stuff, sorry 12:27 <dustin> I'll delete the biggest queues now 12:27 <jlund> dustin: no worries. I'll roll this all back and reconfig. I think using sql might be better for tracking state
Assignee | ||
Comment 5•9 years ago
|
||
this is backed out now. trees should be using mozharness.json again. trees are re-opened
Assignee: nobody → jlund
Severity: blocker → normal
Status: NEW → ASSIGNED
Assignee | ||
Comment 6•9 years ago
|
||
# from PR This combats against creating multiple celery tasks for the same job and manages clean up. It does this by: setting a time limit of how long a task can run for (protecting from getting hung tasks) setting an expiry of how long a task can be un-actioned in the message queue tracking the state of celery tasks with a db backend. ** Prior to this I only tracked if the state given by celery was 'STARTED'. But celery tasks could also be PENDING or RETRY. You can't trust PENDING as that's the default state for tasks that were never even created. explicitly capping the revision at 12 chars. If a buildbot job or user hits the archiver api with both the short hash and the long hash, archiver used to think this was two separate archives. This could be an issue if a bunch of buildbot jobs have the short hash stored in buildbot properties. some house keeping with the trackers in the db. If the task goes past it's time limit or is in a finished state and the tracker still exists in the db, a badpenny job will clean it up. add more logging to help track health of archiver Though much more work than simply: if create_and_upload_archive.AsyncResult(task_id).state != 'STARTED', managing the tasks through a tracker should provide us with much more control.
Attachment #8633804 -
Flags: review?(dustin)
Comment 7•9 years ago
|
||
Comment on attachment 8633804 [details] [review] https://github.com/mozilla/build-relengapi/pull/296 Looks good, but some comments to address in the PR
Attachment #8633804 -
Flags: review?(dustin) → feedback+
Assignee | ||
Comment 8•9 years ago
|
||
Comment on attachment 8633804 [details] [review] https://github.com/mozilla/build-relengapi/pull/296 addresses PR comments
Attachment #8633804 -
Flags: review?(dustin)
Updated•9 years ago
|
Attachment #8633804 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 9•9 years ago
|
||
relengapi staging and prod has been updated to 3.1.2 with https://github.com/mozilla/build-relengapi/pull/296 in it
Assignee | ||
Comment 10•9 years ago
|
||
once this is resolved, we should reverse the patch that added a fallback to hgmo in https://bugzil.la/1181331 the patch: https://bugzilla.mozilla.org/attachment.cgi?id=8635625
Assignee | ||
Comment 11•9 years ago
|
||
we have fixed up a lot of errors that caused perma-bad csets here: https://bugzilla.mozilla.org/show_bug.cgi?id=1185507#c6 however, I came across a recent job that was still timing out. Seems like there is still an issue. e.g. treeherder: https://treeherder.mozilla.org/#/jobs?repo=try&revision=56272fbd3fb5 rev: 56272fbd3fb5 the first build to poll archiver was this one: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nfitzgerald@mozilla.com-56272fbd3fb5/try-linux64-asan-debug/try-linux64-asan-debug-bm75-try1-build856.txt.gz you can see it realizes the s3 url doesn't exist so it creates a celery task. However it just stays pending and times out: 2015-07-24 10:59:02,183 Setting DEBUG logging. 2015-07-24 10:59:02,183 attempt 1/10 2015-07-24 10:59:02,183 Getting archive location from https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness 2015-07-24 10:59:03,528 attempt 1/10 2015-07-24 10:59:04,123 current task status: no status available at this point. state: PENDING 2015-07-24 10:59:04,123 sleeping for 10.00s (attempt 1/10) all of the subsequent builds of this changeset behaved slightly different (client logs): http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nfitzgerald@mozilla.com-56272fbd3fb5/try-android-api-11/try-android-api-11-bm76-try1-build1397.txt.gz bash -c 'python /tools/checkouts/build-tools/buildfarm/utils/archiver_client.py mozharness --repo try --rev 56272fbd3fb5a400eb778bda1f3cc208d79becc0 --destination scripts --debug' in dir /builds/slave/try-l64-0000000000000000000000/. (timeout 1200 secs) 2015-07-24 11:03:16,608 truncating revision to first 12 chars 2015-07-24 11:03:16,608 Setting DEBUG logging. 2015-07-24 11:03:16,608 attempt 1/10 2015-07-24 11:03:16,608 Getting archive location from https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness 2015-07-24 11:03:17,530 sleeping for 11.00s (attempt 1/10) 2015-07-24 11:03:28,541 attempt 2/10 ...ect 2015-07-24 11:10:58,476 Getting archive location from https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness 2015-07-24 11:10:59,360 sleeping for 99.00s (attempt 9/10) 2015-07-24 11:12:38,459 attempt 10/10 2015-07-24 11:12:38,459 Getting archive location from https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness 2015-07-24 11:12:39,366 Could not get a valid response from archiver. Traceback (most recent call last): File "/tools/checkouts/build-tools/buildfarm/utils/archiver_client.py", line 153, in get_url_response response = urllib2.urlopen(api_url, timeout=60) File "/usr/lib64/python2.6/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/usr/lib64/python2.6/urllib2.py", line 397, in open response = meth(req, response) File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib64/python2.6/urllib2.py", line 429, in error result = self._call_chain(*args) File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain result = func(*args) File "/usr/lib64/python2.6/urllib2.py", line 616, in http_error_302 return self.parent.open(new, timeout=req.timeout) File "/usr/lib64/python2.6/urllib2.py", line 397, in open response = meth(req, response) File "/usr/lib64/python2.6/urllib2.py", line 510, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib64/python2.6/urllib2.py", line 435, in error return self._call_chain(*args) File "/usr/lib64/python2.6/urllib2.py", line 369, in _call_chain result = func(*args) File "/usr/lib64/python2.6/urllib2.py", line 518, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) HTTPError: HTTP Error 404: Not Found 2015-07-24 11:12:39,368 attempt 1/3 2015-07-24 11:12:39,368 Trying src_url directly: https://hg.mozilla.org/try/archive/56272fbd3fb5.tar.gz/testing/mozharness 2015-07-24 11:12:41,816 unpacking tar archive at: try-56272fbd3fb5/testing/mozharness/ program finished with exit code 0 notice above, archiver tries to return an s3 url however it doesn't exist and we raise a 404 every time the client polls 'https://api.pub.build.mozilla.org/archiver/hgmo/try/56272fbd3fb5?&preferred_region=us-west-2&suffix=tar.gz&subdir=testing/mozharness' client context: here(http://mxr.mozilla.org/build/source/tools/buildfarm/utils/archiver_client.py#153) raises line 164 relengapi context: https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L186 meanwhile in relengapi logs: /relengapi.log.18:2015-07-24 10:59:03,463 [relengapi.blueprints.archiver] Creating new celery task and task tracker for: try-56272fbd3fb5.tar.gz_testing_mozharness ./relengapi.log.18:2015-07-24 10:59:03,774 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s ./relengapi.log.18:2015-07-24 10:59:14,527 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state PENDING ./relengapi.log.18:2015-07-24 10:59:15,267 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s ./relengapi.log.18:2015-07-24 10:59:15,668 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s ./relengapi.log.18:2015-07-24 10:59:17,118 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s ./relengapi.log.18:2015-07-24 10:59:31,145 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state PENDING ./relengapi.log.18:2015-07-24 10:59:33,055 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s .... # these get urls are repeated dozens of times ./relengapi.log.18:2015-07-24 11:00:35,588 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s ./relengapi.log.18:2015-07-24 11:00:35,893 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state PENDING notice there are no errors above. we log about the intent to create the task and tracker, then at least one host polls the status url location (the first log url above) and it reports as pending. The other requests return generated signed urls for the s3 location but the s3 url doesn't actually exist (the ones that received 404s) meanwhile in celery logs, we can see that the celery task is never even created/started initially at 11:00 PT as the first mention of 56272fbd3fb5 is at 12:18 PT: [2015-07-24 12:18:04,675: INFO/MainProcess] Received task: relengapi.blueprints.badpenny.execution._run_job[b40ca441-1d8d-4f1e-9bdc-0094a7805d75] [2015-07-24 12:18:04,705: INFO/Worker-11] Running task u'relengapi.blueprints.archiver.cleanup_old_tasks' id 20372 [2015-07-24 12:18:04,744: INFO/Worker-11] deleting tracker with id: try-56272fbd3fb5.tar.gz_testing_mozharness [2015-07-24 12:18:04,773: INFO/MainProcess] Task relengapi.blueprints.badpenny.execution._run_job[b40ca441-1d8d-4f1e-9bdc-0094a7805d75] succeeded in 0.0952629260719s: None [2015-07-24 12:18:56,658: INFO/MainProcess] Received task: relengapi.blueprints.archiver.tasks.create_and_upload_archive[try-56272fbd3fb5.tar.gz_testing_mozharness] expires:[2015-07-24 19:48:56.592159+00:00] [2015-07-24 12:18:56,702: INFO/Worker-17] Key to be uploaded to S3: try-56272fbd3fb5.tar.gz/testing/mozharness - Verifying src_url: https://hg.mozilla.org/try/archive/56272fbd3fb5.tar.gz/testing/mozharness [2015-07-24 12:18:58,865: INFO/Worker-17] Key to be uploaded to S3: try-56272fbd3fb5.tar.gz/testing/mozharness - downloading and unpacking archive from src_url [2015-07-24 12:19:02,595: INFO/MainProcess] Task relengapi.blueprints.archiver.tasks.create_and_upload_archive[try-56272fbd3fb5.tar.gz_testing_mozharness] succeeded in 5.93358509988s: {'status': "Task completed! Check 's3_urls' for upload locations.", 's3_urls': {'us-west-2':... the first time that 56272fbd3fb5 is logged is when we are cleaning up the tracker over an hour later thus proving that at least the tracker was added to the db and seemingly, a task was created: https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L177 however, it's was only after badpenny rm'd the tracker that we created a new celery task and actually uploaded to s3 successfully: back at relengapi logs later on after badpenny cleaned up the tracker, everything works fine again: /relengapi.log.9:2015-07-24 12:17:15,184 [relengapi.blueprints.archiver] generating GET URL to try-56272fbd3fb5.tar.gz/testing/mozharness, expires in 300s ./relengapi.log.9:2015-07-24 12:18:56,591 [relengapi.blueprints.archiver] Creating new celery task and task tracker for: try-56272fbd3fb5.tar.gz_testing_mozharness ./relengapi.log.9:2015-07-24 12:18:56,863 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state STARTED ./relengapi.log.9:2015-07-24 12:19:08,269 [relengapi.blueprints.archiver] checking status of task id try-56272fbd3fb5.tar.gz_testing_mozharness: current state SUCCESS ./relengapi.log.9:2015-07-24 12:19:08,270 [relengapi.blueprints.archiver] deleting tracker with id: try-56272fbd3fb5.tar.gz_testing_mozharness my conclusions: 1) on the first few builds at 1100PT we try to create a tracker and task but subsequent jobs are evaluating False for one of these two conditions: - https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L161 - https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L173 2) none of the first few builds at 1100PT ever seem to have a celery task that starts otherwise we would see log lines like the following around 1100PT in celery1: - https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/tasks.py#L26 3) after an hour, we clean up the tracker (and celery/rabbit would remove any expired tasks), and somehow relengapi stops reporting that the s3 url exists so we create another new task here (which completes fine this time) 4) one obvious bug is that we shouldn't be able to make it past https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L173 without returning an error. However, if badpenny deletes a tracker that expired at 12:18, how are we skipping L173 in the first few builds? </ braindump>
Assignee | ||
Comment 12•9 years ago
|
||
> 4) one obvious bug is that we shouldn't be able to make it past
> https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/
> archiver/__init__.py#L173 without returning an error. However, if badpenny
> deletes a tracker that expired at 12:18, how are we skipping L173 in the
> first few builds?
I read L173 backwards. we do have a tracker, so we should be returning a 202 with the location for where the task is running. Which means, I should be doing:
diff --git a/relengapi/blueprints/archiver/__init__.py b/relengapi/blueprints/archiver/__init__.py
index c50a645..9f3933d 100644
--- a/relengapi/blueprints/archiver/__init__.py
+++ b/relengapi/blueprints/archiver/__init__.py
@@ -160,31 +160,31 @@ def get_archive(src_url, key, preferred_region):
# first, see if the key exists
if not s3.get_bucket(bucket).get_key(key):
task_id = key.replace('/', '_') # keep things simple and avoid slashes in task url
# can't use unique support:
# api.pub.build.mozilla.org/docs/development/databases/#unique-row-support-get-or-create
# because we want to know when the row doesn't exist before creating it
tracker = tables.ArchiverTask.query.filter(tables.ArchiverTask.task_id == task_id).first()
if tracker and tracker.state in FINISHED_STATES:
log.info('Task tracker: {} exists but finished with state: '
'{}'.format(task_id, tracker.state))
# remove tracker and try celery task again
delete_tracker(tracker)
tracker = None
if not tracker:
log.info("Creating new celery task and task tracker for: {}".format(task_id))
task = create_and_upload_archive.apply_async(args=[src_url, key], task_id=task_id)
if task and task.id:
session.add(tables.ArchiverTask(task_id=task.id, created_at=now(),
src_url=src_url, state="PENDING"))
session.commit()
- return {}, 202, {'Location': url_for('archiver.task_status', task_id=task.id)}
else:
return {}, 500
+ return {}, 202, {'Location': url_for('archiver.task_status', task_id=task.id)}
log.info("generating GET URL to {}, expires in {}s".format(key, GET_EXPIRES_IN))
# return 302 pointing to s3 url with archive
signed_url = s3.generate_url(
method='GET', expires_in=GET_EXPIRES_IN,
bucket=bucket, key=key
)
return redirect(signed_url)
Assignee | ||
Comment 13•9 years ago
|
||
so now the question is, why is the original archiver call here: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/nfitzgerald@mozilla.com-56272fbd3fb5/try-linux64-asan-debug/try-linux64-asan-debug-bm75-try1-build856.txt.gz /relengapi.log.18:2015-07-24 10:59:03,463 [relengapi.blueprints.archiver] Creating new celery task and task tracker for: try-56272fbd3fb5.tar.gz_testing_mozharness somehow able to create a tracker row if the celery task never actually starts: https://github.com/mozilla/build-relengapi/blob/master/relengapi/blueprints/archiver/__init__.py#L176 I bet rabbit logs at 11:00 PT will give more insight for rev: 56272fbd3fb5 but I don't have access to the rabbit web view.
Assignee | ||
Comment 14•9 years ago
|
||
one solution if this just means creating celery tasks are not 100% reliable: simply say a celery task for Mozharness should only take 2min. If the task is not completed by then, create a new task. We are already cleaning up celery tasks that are hung (time_limit) or still in rabbit queue (expires) so it shouldn't be that bad to create another one here and there.
Comment 15•9 years ago
|
||
r+ to the diff in comment 12. There's not a lot to love in the logs: =INFO REPORT==== 24-Jul-2015::10:58:05 === accepting AMQP connection <0.20123.3706> (10.22.81.209:16866 -> 10.22.81.92:5672) =INFO REPORT==== 24-Jul-2015::10:59:03 === accepting AMQP connection <0.20766.3706> (10.22.81.209:24111 -> 10.22.81.92:5672) =WARNING REPORT==== 24-Jul-2015::10:59:19 === closing AMQP connection <0.17630.3706> (10.22.81.209:50454 -> 10.22.81.92:5672): connection_closed_abruptly =INFO REPORT==== 24-Jul-2015::10:59:32 === accepting AMQP connection <0.21068.3706> (10.22.81.209:29109 -> 10.22.81.92:5672) =INFO REPORT==== 24-Jul-2015::10:59:32 === accepting AMQP connection <0.21074.3706> (10.22.81.209:29112 -> 10.22.81.92:5672) =WARNING REPORT==== 24-Jul-2015::10:59:37 === closing AMQP connection <0.17817.3706> (10.22.81.209:52735 -> 10.22.81.92:5672): connection_closed_abruptly =INFO REPORT==== 24-Jul-2015::11:00:15 === accepting AMQP connection <0.21553.3706> (10.22.81.209:35374 -> 10.22.81.92:5672) =WARNING REPORT==== 24-Jul-2015::11:00:58 === closing AMQP connection <0.18698.3706> (10.22.81.209:63655 -> 10.22.81.92:5672): connection_closed_abruptly =INFO REPORT==== 24-Jul-2015::11:01:09 === accepting AMQP connection <0.22129.3706> (10.22.81.209:42948 -> 10.22.81.92:5672) =WARNING REPORT==== 24-Jul-2015::11:01:49 === closing AMQP connection <0.19212.3706> (10.22.81.209:5362 -> 10.22.81.92:5672): connection_closed_abruptly =INFO REPORT==== 24-Jul-2015::11:03:07 === accepting AMQP connection <0.23395.3706> (10.22.81.209:58479 -> 10.22.81.92:5672) ... and that connection closed quite a bit later ... =WARNING REPORT==== 24-Jul-2015::11:48:55 === closing AMQP connection <0.20766.3706> (10.22.81.209:24111 -> 10.22.81.92:5672): connection_closed_abruptly With the change away from tracking results as queues, rabbit itself shows very little of interest in the web view. The `celery` queue is empty, so the message isn't stuck in there somewhere. However, looking in the DB: mysql> select * from celery_taskmeta where task_id like '%56272fbd3fb5%' \G *************************** 1. row *************************** id: 46 task_id: try-56272fbd3fb5.tar.gz_testing_mozharness status: SUCCESS result: �}q(UstatusqU5Task completed! Check 's3_urls' for upload locations.qUs3_urlsq}q(U us-west-2qU�https://mozilla-releng-usw2-archiver.s3-us-west-2.amazonaws.com/try-56272fbd3fb5.tar.gz/testing/mozharness?Signature=59441kNImqdKgWXeJFRCZVir8oo%3D&Expires=1437765841&AWSAccessKeyId=AKIAILEHYUYSBPB5N3DQU us-east-1qU�https://mozilla-releng-use1-archiver.s3.amazonaws.com/try-56272fbd3fb5.tar.gz/testing/mozharness?Signature=LC%2FUiw8bqaCP%2Fa2nBx5Pbc15E8k%3D&Expires=1437765842&AWSAccessKeyId=AKIAILEHYUYSBPB5N3DQuUsrc_urlXI https://hg.mozilla.org/try/archive/56272fbd3fb5.tar.gz/testing/mozharnessq u. date_done: 2015-07-24 19:19:03 traceback: NULL that date_done is UTC, equivalent to 12:19:03 PT. So that represents the second run. The closest thing time-wise to the first run is mysql> select * from celery_taskmeta where date_done = '2015-07-24 18:06:04' \G *************************** 1. row *************************** id: 19 task_id: try-52d47fd24448.tar.gz_testing_mozharness status: SUCCESS result: �}q(UstatusqU5Task completed! Check 's3_urls' for upload locations.qUs3_urlsq}q(U us-west-2qU�https://mozilla-releng-usw2-archiver.s3-us-west-2.amazonaws.com/try-52d47fd24448.tar.gz/testing/mozharness?Signature=jb5Ilp5PH4sKSmLxPe1fkmzqBWE%3D&Expires=1437761462&AWSAccessKeyId=AKIAILEHYUYSBPB5N3DQU us-east-1qU�https://mozilla-releng-use1-archiver.s3.amazonaws.com/try-52d47fd24448.tar.gz/testing/mozharness?Signature=4kbcG%2ByP%2F2tflpTWlz2OtTsZsh8%3D&Expires=1437761463&AWSAccessKeyId=AKIAILEHYUYSBPB5N3DQuUsrc_urlXI https://hg.mozilla.org/try/archive/52d47fd24448.tar.gz/testing/mozharnessq u. date_done: 2015-07-24 18:06:04 traceback: NULL
Comment 16•9 years ago
|
||
Assignee | ||
Comment 17•9 years ago
|
||
thanks for poking at this (In reply to Dustin J. Mitchell [:dustin] from comment #15) > r+ to the diff in comment 12. sweet! > > =INFO REPORT==== 24-Jul-2015::11:03:07 === > accepting AMQP connection <0.23395.3706> (10.22.81.209:58479 -> > 10.22.81.92:5672) > > ... and that connection closed quite a bit later ... > > =WARNING REPORT==== 24-Jul-2015::11:48:55 === > closing AMQP connection <0.20766.3706> (10.22.81.209:24111 -> > 10.22.81.92:5672): > connection_closed_abruptly so it sounds like the way things are configured, there is a reliability issue. Some examples I found for intermittent perma PENDING - https://github.com/celery/celery/issues/1042 - not enough disk space on the rabbit hosts - https://groups.google.com/forum/#!topic/celery-users/Xt9mjO7ZpQc - having to set strange celery configuration > > The `celery` queue is empty, so > the message isn't stuck in there somewhere. hrm, I wonder if it never made it or maybe was cleaned up after it expired. > However, looking in the DB: > so if the db is only tracking task results, and the first celery task never started, wouldn't there never be a record in the db anyway right? Unless you stop me, I'm going to pursue a change where we re-create the task if we are still PENDING after X time (say 60s). If we have STARTED or we are in the middle of RETRY, keep the task going. This doesn't address the occasional unreliability between relengapi <-> archiver <-> celery <-> rabbit but it would remove our reliance on falling back to hgmo when we get stuck like this
Comment 18•9 years ago
|
||
camd, We're seeing some instability with celery (specifically tasks not starting) among other issues. We switched to a DB-based results backend instead of a rabbit/amqp based. I'm n-i'ing you to see if you saw anything similar with treeherder, and/or for some insight into what you guys may be doing differently with that so we can try and get our end fixed. Any ideas welcome. (I'd rather fix the cause not the symptom)
Flags: needinfo?(cdawson)
Assignee | ||
Comment 19•9 years ago
|
||
I know this is all overkill. Having a results tracker for celery tasks means something is likely broken or that celery is not the right tool for my needs. But I am hoping the added complexity can at least unblock us from our reliance on falling back to hgmo when archiver celery tasks get in a perma-pending state
Attachment #8639393 -
Flags: review?(dustin)
Comment 20•9 years ago
|
||
Comment on attachment 8639393 [details] [review] adds pending expiry time to archiver celery tracker I'm not sure the logic is correct (maybe I'm just missing something)
Attachment #8639393 -
Flags: review?(dustin)
Comment 21•9 years ago
|
||
Comment on attachment 8639393 [details] [review] adds pending expiry time to archiver celery tracker I get it now :)
Attachment #8639393 -
Flags: review+
Assignee | ||
Comment 22•9 years ago
|
||
the band-aid seems to have helped: you can see here we have a task stuck pending after a min, we delete the tracker and trigger another celery task: 2015-07-28 15:05:33,111 [relengapi.blueprints.archiver] Creating new celery task and task tracker for: try-a25b10547463.tar.gz_testing_mozharness 2015-07-28 15:05:33,580 [relengapi.blueprints.archiver] checking status of task id try-a25b10547463.tar.gz_testing_mozharness: current state PENDING 2015-07-28 15:06:35,056 [relengapi.blueprints.archiver] checking status of task id try-a25b10547463.tar.gz_testing_mozharness: current state PENDING 2015-07-28 15:06:35,211 [relengapi.blueprints.archiver] Task try-a25b10547463.tar.gz_testing_mozharness has expired from pending too long. Re-creating task 2015-07-28 15:06:35,212 [relengapi.blueprints.archiver] renewing tracker 3435 with pending expiry: 2015-07-28 22:07:35.212706+00:00 2015-07-28 15:06:53,778 [relengapi.blueprints.archiver] checking status of task id try-a25b10547463.tar.gz_testing_mozharness: current state SUCCESS from the db side: https://pastebin.mozilla.org/8840887 so I think we are good here. *however* this happens way too often. Not sure on exact numbers yet but 1 out of every dozen or few dozen celery tasks never actually leave PENDING state It would be great to find out why...
Comment 23•9 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #18) > camd, > > We're seeing some instability with celery (specifically tasks not starting) > among other issues. We switched to a DB-based results backend instead of a > rabbit/amqp based. > > I'm n-i'ing you to see if you saw anything similar with treeherder, and/or > for some insight into what you guys may be doing differently with that so we > can try and get our end fixed. > > Any ideas welcome. (I'd rather fix the cause not the symptom) callek-- We have had stability issues with Celery at times. Most of ours were due to restarting during deployments. It sometimes leaves Zombie processes going. Early on, we started using the ``maxtasksperchild`` setting: http://celery.readthedocs.org/en/latest/userguide/workers.html#max-tasks-per-child-setting. This helped some of our issues and seemed to provide better longevity. There is a memory leak somewhere in our data ingestion, but we were never able to track it down. I can't think of anything more than that. Though I admit I don't grok everything in this bug. It's a biggie... :) Looping in mdoglio in case he has any more info.
Updated•9 years ago
|
Flags: needinfo?(cdawson) → needinfo?(mdoglio)
Assignee | ||
Comment 24•9 years ago
|
||
we are ready to try removing the need to fall-back on hgmo thanks to https://bugzilla.mozilla.org/show_bug.cgi?id=1185507#c6 and https://bugzilla.mozilla.org/show_bug.cgi?id=1182532#c22 rail, this reverses the patch I had you review here: https://bug1181331.bmoattachments.org/attachment.cgi?id=8635625 it also lets any exceptions raised on retry attempts to be logged after each attempt so we know earlier in the log if something is wrong.
Attachment #8640576 -
Flags: review?(rail)
Updated•9 years ago
|
Attachment #8640576 -
Flags: review?(rail) → review+
Assignee | ||
Comment 25•9 years ago
|
||
Comment on attachment 8640576 [details] [diff] [review] 150729_1182532_archiver_intermittent_pending-rms_dep_on_hgmo_fallback-tools.patch ta https://hg.mozilla.org/build/tools/rev/301ea12efbad
Attachment #8640576 -
Flags: checked-in+
Comment 26•9 years ago
|
||
I have never experienced a situation where a task is not picked up by a worker. Do you have a task monitor showing the list of tasks stuck in pending state?
Flags: needinfo?(mdoglio)
Comment 27•9 years ago
|
||
I have something similar in bug 1190405. Signing workers don't use celery, but they use kobmu, which is used by celery. Setting explicitly the heartbeat interval helped there. You can try setting http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#cmdoption-celery-worker--heartbeat-interval
Comment 28•9 years ago
|
||
The heartbeat setting is being addressed in bug 1190408. Mauro, thanks for the info. What do you use for monitoring? I've been using the celery command (wrapped to set up the environment): [root@celery1.srv.releng.scl3 relengapi]# ./run-celery.sh inspect active -> celery@celery1.srv.releng.scl3.mozilla.com: OK - empty - but it has never really given me useful data.
Comment 29•9 years ago
•
|
||
sorry wrong bug.
Comment 30•9 years ago
|
||
(:dgarvey is referring to a comment he deleted)
Assignee | ||
Updated•8 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•