Closed Bug 909011 Opened 11 years ago Closed 9 years ago

[traceback] handle occasional amqplib connection errors

Categories

(Input Graveyard :: Code Quality, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

(Whiteboard: u=user c=general p=1 s=input.2015q1)

When someone leaves feedback, we save the feedback in the database and then kick off a celery task to index that response.

Very infrequently, there's a connection error in amqplib which I'm pretty sure means the response isn't indexed in ES and also might mean that the user gets back an HTTP 500 error.

This bug covers writing a test for that to see whether those two theories are true and alleviating the issue in some way.


Traceback from production:

Traceback (most recent call last):

  File "/data/www/input.mozilla.org/input/vendor/lib/python/django/core/handlers/base.py", line 111, in get_response
    response = callback(request, *callback_args, **callback_kwargs)

  File "/usr/lib64/python2.6/site-packages/newrelic-1.11.0.55/newrelic/api/object_wrapper.py", line 216, in __call__
    self._nr_instance, args, kwargs)

  File "/usr/lib64/python2.6/site-packages/newrelic-1.11.0.55/newrelic/hooks/framework_django.py", line 475, in wrapper
    return wrapped(*args, **kwargs)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/django/views/decorators/csrf.py", line 77, in wrapped_view
    return view_func(*args, **kwargs)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/django/views/decorators/cache.py", line 89, in _wrapped_view_func
    response = view_func(request, *args, **kwargs)

  File "/data/www/input.mozilla.org/input/fjord/feedback/views.py", line 265, in feedback_router
    return view(request, *args, **kwargs)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/django/views/decorators/csrf.py", line 77, in wrapped_view
    return view_func(*args, **kwargs)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/django/views/decorators/http.py", line 41, in inner
    return func(request, *args, **kwargs)

  File "/data/www/input.mozilla.org/input/fjord/feedback/views.py", line 213, in android_about_feedback
    response, form = _handle_feedback_post(request)

  File "/data/www/input.mozilla.org/input/fjord/feedback/views.py", line 110, in _handle_feedback_post
    opinion.save()

  File "/data/www/input.mozilla.org/input/vendor/lib/python/django/db/models/base.py", line 463, in save
    self.save_base(using=using, force_insert=force_insert, force_update=force_update)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/django/db/models/base.py", line 565, in save_base
    created=(not record_exists), raw=raw, using=using)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/django/dispatch/dispatcher.py", line 172, in send
    response = receiver(signal=self, sender=sender, **named)

  File "/data/www/input.mozilla.org/input/fjord/search/tasks.py", line 122, in _live_index_handler
    index_item_task.delay(instance.get_mapping_type(), instance.id)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/celery/app/task/__init__.py", line 353, in delay
    return self.apply_async(args, kwargs)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/celery/app/task/__init__.py", line 449, in apply_async
    publish = publisher or self.app.amqp.publisher_pool.acquire(block=True)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/connection.py", line 657, in acquire
    R = self.prepare(R)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/pools.py", line 57, in prepare
    p.revive(connection.default_channel)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/connection.py", line 583, in default_channel
    self._default_channel = self.channel()

  File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/connection.py", line 151, in channel
    chan = self.transport.create_channel(self.connection)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/transport/amqplib.py", line 259, in create_channel
    return connection.channel()

  File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/transport/amqplib.py", line 183, in channel
    return Channel(self, channel_id)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/kombu/transport/amqplib.py", line 207, in __init__
    super(Channel, self).__init__(*args, **kwargs)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/channel.py", line 82, in __init__
    self._x_open()

  File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/channel.py", line 469, in _x_open
    self._send_method((20, 10), args)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/abstract_channel.py", line 76, in _send_method
    method_sig, args, content)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/method_framing.py", line 252, in write_method
    self.dest.write_frame(1, channel, payload)

  File "/data/www/input.mozilla.org/input/vendor/lib/python/amqplib/client_0_8/transport.py", line 165, in write_frame
    frame_type, channel, size, payload, 0xce))

  File "<string>", line 1, in sendall

error: [Errno 32] Broken pipe
These could be causing some responses not to get added to ES.

Fixing the whiteboard data.
Priority: -- → P3
Whiteboard: u=user c=general p= se=input.2013q4 → u=user c=general p= s=input.2013q4
Pushing this to 2014q1 because I can't get to it this quarter.
Whiteboard: u=user c=general p= s=input.2013q4 → u=user c=general p= s=input.2014q1
Moving this to 2014q2.
Whiteboard: u=user c=general p= s=input.2014q1 → u=user c=general p= s=input.2014q2
When we add other post_save -> celery task jobs, this will affect those, too. Ricky mentioned maybe tossing those tasks in a db-based queue when amqplib fails.
I haven't seen one of these in a while. So bumping to next 2014q3. If I haven't seen one in a while at that point, I'll nix it.
Whiteboard: u=user c=general p= s=input.2014q2 → u=user c=general p= s=input.2014q3
Whiteboard: u=user c=general p= s=input.2014q3 → u=user c=general p= s=input.2014q4
These still happen periodically, but not often enough to warrant spending time on this. Pushing it to the backlog.
Whiteboard: u=user c=general p= s=input.2014q4 → u=user c=general p= s=
They're doing a PtoV for rabbitmq first week in march. I'm going to grab this bug and make sure that Input is resilient to rabbitmq being unavailable.

Outcome of this work should be that users leaving feedback *never* see an HTTP 500 if rabbitmq is down or amqp connection fails in some way. For the purposes of making sure it's not silently failing without anyone knowing, I'll have it send me emails.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: P3 → P1
Whiteboard: u=user c=general p= s= → u=user c=general p= s=input.2015q1
I wrapped the indexing task creation in some code that checks to see if the problem is amqp related and if so, it sends an error email to the admin, but otherwise doesn't do anything and thus doesn't kick up an HTTP 500 for the user. This should adequately handle most/all rabbitmq outages including the upcoming PtoV work.

In a PR: https://github.com/mozilla/fjord/pull/497#issuecomment-75563467

Landed in https://github.com/mozilla/fjord/commit/1f3bde06298007396117b27545d29aefb5f9fd30

Pushed it out just now. I'll keep an eye on it, but so far it looks good.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: u=user c=general p= s=input.2015q1 → u=user c=general p=1 s=input.2015q1
Product: Input → Input Graveyard
You need to log in before you can comment on or make changes to this bug.