Closed Bug 909011 Opened 11 years ago Closed 10 years ago

[traceback] handle occasional amqplib connection errors

Categories

(Input Graveyard :: Code Quality, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

(Whiteboard: u=user c=general p=1 s=input.2015q1)

When someone leaves feedback, we save the feedback in the database and then kick off a celery task to index that response. Very infrequently, there's a connection error in amqplib which I'm pretty sure means the response isn't indexed in ES and also might mean that the user gets back an HTTP 500 error. This bug covers writing a test for that to see whether those two theories are true and alleviating the issue in some way. Traceback from production: Traceback (most recent call last): File "/data/www/input.mozilla.org/input/vendor/lib/python/django/core/handlers/base.py", line 111, in get_response response = callback(request, *callback_args, **callback_kwargs) File "/usr/lib64/python2.6/site-packages/newrelic-1.11.0.55/newrelic/api/object_wrapper.py", line 216, in __call__ self._nr_instance, args, kwargs) File "/usr/lib64/python2.6/site-packages/newrelic-1.11.0.55/newrelic/hooks/framework_django.py", line 475, in wrapper return wrapped(*args, **kwargs) File "/data/www/input.mozilla.org/input/vendor/lib/python/django/views/decorators/csrf.py", line 77, in wrapped_view return view_func(*args, **kwargs) File "/data/www/input.mozilla.org/input/vendor/lib/python/django/views/decorators/cache.py", line 89, in _wrapped_view_func response = view_func(request, *args, **kwargs) File "/data/www/input.mozilla.org/input/fjord/feedback/views.py", line 265, in feedback_router return view(request, *args, **kwargs) File "/data/www/input.mozilla.org/input/vendor/lib/python/django/views/decorators/csrf.py", line 77, in wrapped_view return view_func(*args, **kwargs) File "/data/www/input.mozilla.org/input/vendor/lib/python/django/views/decorators/http.py", line 41, in inner return func(request, *args, **kwargs) File "/data/www/input.mozilla.org/input/fjord/feedback/views.py", line 213, in android_about_feedback response, form = _handle_feedback_post(request) File "/data/www/input.mozilla.org/input/fjord/feedback/views.py", line 110, in _handle_feedback_post opinion.save() File "/data/www/input.mozilla.org/input/vendor/lib/python/django/db/models/base.py", line 463, in save self.save_base(using=using, force_insert=force_insert, force_update=force_update) File "/data/www/input.mozilla.org/input/vendor/lib/python/django/db/models/base.py", line 565, in save_base created=(not record_exists), raw=raw, using=using) File "/data/www/input.mozilla.org/input/vendor/lib/python/django/dispatch/dispatcher.py", line 172, in send response = receiver(signal=self, sender=sender, **named) File "/data/www/input.mozilla.org/input/fjord/search/tasks.py", line 122, in _live_index_handler index_item_task.delay(instance.get_mapping_type(), instance.id) File "/data/www/input.mozilla.org/input/vendor/lib/python/celery/app/task/__init__.py", line 353, in delay return self.apply_async(args, kwargs) File "/data/www/input.mozilla.org/input/vendor/lib/python/celery/app/task/__init__.py", line 449, in apply_async publish = publisher or self.app.amqp.publisher_pool.acquire(block=True) File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/connection.py", line 657, in acquire R = self.prepare(R) File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/pools.py", line 57, in prepare p.revive(connection.default_channel) File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/connection.py", line 583, in default_channel self._default_channel = self.channel() File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/connection.py", line 151, in channel chan = self.transport.create_channel(self.connection) File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/transport/amqplib.py", line 259, in create_channel return connection.channel() File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/transport/amqplib.py", line 183, in channel return Channel(self, channel_id) File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/transport/amqplib.py", line 207, in __init__ super(Channel, self).__init__(*args, **kwargs) File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/channel.py", line 82, in __init__ self._x_open() File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/channel.py", line 469, in _x_open self._send_method((20, 10), args) File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/abstract_channel.py", line 76, in _send_method method_sig, args, content) File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/method_framing.py", line 252, in write_method self.dest.write_frame(1, channel, payload) File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/transport.py", line 165, in write_frame frame_type, channel, size, payload, 0xce)) File "<string>", line 1, in sendall error: [Errno 32] Broken pipe
These could be causing some responses not to get added to ES. Fixing the whiteboard data.
Priority: -- → P3
Whiteboard: u=user c=general p= se=input.2013q4 → u=user c=general p= s=input.2013q4
Pushing this to 2014q1 because I can't get to it this quarter.
Whiteboard: u=user c=general p= s=input.2013q4 → u=user c=general p= s=input.2014q1
Moving this to 2014q2.
Whiteboard: u=user c=general p= s=input.2014q1 → u=user c=general p= s=input.2014q2
When we add other post_save -> celery task jobs, this will affect those, too. Ricky mentioned maybe tossing those tasks in a db-based queue when amqplib fails.
I haven't seen one of these in a while. So bumping to next 2014q3. If I haven't seen one in a while at that point, I'll nix it.
Whiteboard: u=user c=general p= s=input.2014q2 → u=user c=general p= s=input.2014q3
Whiteboard: u=user c=general p= s=input.2014q3 → u=user c=general p= s=input.2014q4
These still happen periodically, but not often enough to warrant spending time on this. Pushing it to the backlog.
Whiteboard: u=user c=general p= s=input.2014q4 → u=user c=general p= s=
They're doing a PtoV for rabbitmq first week in march. I'm going to grab this bug and make sure that Input is resilient to rabbitmq being unavailable. Outcome of this work should be that users leaving feedback *never* see an HTTP 500 if rabbitmq is down or amqp connection fails in some way. For the purposes of making sure it's not silently failing without anyone knowing, I'll have it send me emails.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: P3 → P1
Whiteboard: u=user c=general p= s= → u=user c=general p= s=input.2015q1
I wrapped the indexing task creation in some code that checks to see if the problem is amqp related and if so, it sends an error email to the admin, but otherwise doesn't do anything and thus doesn't kick up an HTTP 500 for the user. This should adequately handle most/all rabbitmq outages including the upcoming PtoV work. In a PR: https://github.com/mozilla/fjord/pull/497#issuecomment-75563467 Landed in https://github.com/mozilla/fjord/commit/1f3bde06298007396117b27545d29aefb5f9fd30 Pushed it out just now. I'll keep an eye on it, but so far it looks good.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: u=user c=general p= s=input.2015q1 → u=user c=general p=1 s=input.2015q1
Product: Input → Input Graveyard
You need to log in before you can comment on or make changes to this bug.