Closed Bug 1394557 Opened 7 years ago Closed 6 years ago

Intermittent "400 Bad Request" errors when uploading to the TC queue causing Windows job failures

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: RyanVM, Assigned: pmoore)

References

Details

Attachments

(1 file)

Github Pull Request for generic-worker 7 years ago Pete Moore [:pmoore][:pete] 53 bytes, text/x-github-pull-request	garndt : review+	Details \| Review

Ryan VanderMeulen [:RyanVM]

Reporter

Description

•

7 years ago

This just burned a Windows Beta job and will be delaying go-to-build as a result, so setting the severity to Critical here. Especially because it means we have to do the |taskcluster task rerun| dance (which few people have ability to do) and risk hitting bug 1381768 on the retrigger.

These failures show up in Treeherder as infra exceptions (purple) and the TH log parser isn't able to see anything useful to highlight, so I fully expect this bug to get little in the way of starring activity, but a cursory glance at TH suggests it's happening on a daily basis at least.

Examples:
https://queue.taskcluster.net/v1/task/S7va0KV_QW2YHzcDy7Ejtw/runs/0/artifacts/public/logs/live_backing.log
https://queue.taskcluster.net/v1/task/PPUAp2JZSuKILy-wiK_fsQ/runs/0/artifacts/public/logs/live_backing.log

Is there something we can do to be more tolerant of these failures if we can't make them go away outright?

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

7 years ago

It looks like these are timeouts talking to S3.

Pete, can you have a look?  I wonder if we're just doing too many parallel uploads and thus slowing down too much?

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Assignee

Comment 2

•

7 years ago

So I think we can improve the situation here. What sucks is that S3 is returning an HTTP 400 response when there is a delay sending data:

> Error uploading artifact: (Permanent) HTTP response code 400

And, as https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 says:

> The HTTP 400 Bad Request response status code indicates that the server could not understand the
> request due to invalid syntax. The client should not repeat this request without modification.

So since we get a 400 response, we intentionally don't retry. Really, I think AWS shouldn't return with HTTP 400 since it might not be a client issue, but network congestion etc.

I'd propose we special case this particular failure, and make it retry.

Note, we (rather inefficiently) upload artifacts in series rather than parallel, so there should only be one upload running at a time.

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Assignee

Comment 3

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #2)
> What sucks is that S3 is returning an HTTP 400 response when there is a delay sending data:

For context, the failure message is:

  "Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed."

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

7 years ago

Assignee: nobody → pmoore

Pete Moore [:pmoore][:pete]

Assignee

Comment 4

•

7 years ago

Attached file Github Pull Request for generic-worker — Details

This should do it.

Attachment #8902648 - Flags: review?(dustin)

[github robot]

Comment 5

•

7 years ago

Commit pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/e96b1bdd5adb2ac91db5c5a338d67617511a4b0b
Bug 1394557 - retry artifact uploads with HTTP 400 status code response (#63)

Pete Moore [:pmoore][:pete]

Assignee

Comment 6

•

7 years ago

Release generic-worker 10.2.1 in progress, should appear here:

* https://github.com/taskcluster/generic-worker/releases/tag/v10.2.1

We'll still need to roll it out in https://github.com/mozilla-releng/OpenCloudConfig

Comment hidden (Intermittent Failures Robot)

3 failures in 939 pushes (0.003 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-beta: 3

Platform breakdown:
* windows2012-32: 2
* windows2012-64-devedition: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1394557&startday=2017-08-28&endday=2017-09-03&tree=all

Greg Arndt [:garndt]

Comment 8

•

7 years ago

Comment on attachment 8902648 [details] [review]
Github Pull Request for generic-worker

carrying over the approved review from github

Attachment #8902648 - Flags: review?(dustin) → review+

Greg Arndt [:garndt]

Comment 9

•

7 years ago

Looks like the builders were updated to 10.2.2 recently which should include this fix:
https://github.com/mozilla-releng/OpenCloudConfig/commit/25a2cc91a604f132aea2f30f35ab18a83df4fc8f

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Pete Moore [:pmoore][:pete]

Assignee

Comment 10

•

7 years ago

Word of warning: test workers (win7/win10) can also be hit by this in test tasks - that should disappear when bug 1399401 lands.

Pete Moore [:pmoore][:pete]

Assignee

Comment 11

•

7 years ago

Reopening as this still affects testers, until bug 1399401 lands...

Status: RESOLVED → REOPENED

Depends on: 1399401

Resolution: FIXED → ---

Pete Moore [:pmoore][:pete]

Assignee

Updated

•

6 years ago

Severity: critical → normal

Chris Cooper [:coop] (he/him)

Updated

•

6 years ago

Blocks: tc-stability

Pete Moore [:pmoore][:pete]

Assignee

Comment 13

•

6 years ago

Deployed to all gecko windows workers in bug 1399401.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 6 years ago

Resolution: --- → FIXED

Pete Moore [:pmoore][:pete]

Assignee

Comment 14

•

6 years ago

Released in https://github.com/taskcluster/generic-worker/releases/tag/v10.2.1

Bogdan Tara[:bogdan_tara | bogdant]

Updated

•

4 years ago

Blocks: 1654126

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

4 years ago

No longer blocks: 1654126

You need to log in before you can comment on or make changes to this bug.