Closed Bug 1177190 Opened 9 years ago Closed 9 years ago

git+http doesn't appear to honor keep alive settings with Centos 6

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(firefox42 fixed)

RESOLVED FIXED
Tracking Status
firefox42 --- fixed

People

(Reporter: hwine, Assigned: wcosta)

References

Details

Attachments

(2 files)

tl;dr: timeouts fetching from git.mozilla.org may be due to lack of HTTP keep alive support in libcurl

Over the last few days, there have been numerous reports of timeouts in TC jobs interacting with git.mozilla.org.

Due to excellent detective work by a combined TC, MOC, Dev Services & Releng Crew, the following events were noticed:
 - TC builder client had a "hung" git fetch for /external/caf/platform/external/libpng.git
 - TC builder client had a TCP socket in CLOSE_WAIT state
 - git1.dmz.scl3 did not have an associated connection
 - git1.dmz.scl3 did have some "client disconnect" messages, but unclear if related
 - zlb VIP did not have an associated connection
 - zlb does not log connection terminations

The socket in CLOSE_WAIT state triggered a check of keep alive configuration. Neither client nor server override the default setting of 5 seconds for git protocol connections.

However, while researching if there was a configuration setting for keepalive on git+HTTP, the following article http://git.661346.n2.nabble.com/PATCH-http-enable-keepalive-on-TCP-sockets-td7597589.html suggested that git+HTTP keepalives were only supported with libcurl version 7.25 and later.

Investigation of the TC builder client showed it is using centos6, which has version 7.16.7 of libcurl.

There are several options from here - that's what this bug is to coordinate.
Awesome detective work guys, and nice summary!
Do we understand why this only started to affect us so severely within the last week?
Moving bug - also occurring in Buildbot jobs, which also use Centos6 builders.
Component: TaskCluster → General Automation
Product: Testing → Release Engineering
QA Contact: catlee
Attached patch timeout.patchSplinter Review
WORKAROUND: change timeout to allow quicker fails while rest of problem investigated.

"10 min" picked as 33% higher than average time.
Attachment #8626311 - Flags: review?(catlee)
Comment on attachment 8626311 [details] [diff] [review]
timeout.patch

r+ from :catlee IRL (yay WW)
Attachment #8626311 - Flags: review?(catlee) → review+
Next step is to get someone from b2g build team to debug the "repo tool" output and/or add debugging output to it.

Until we know what specific command is failing, and how, we're stuck.

ni: mwu for help and/or a reference
Flags: needinfo?(mwu)
This is arguably a tree-closing issue (or hide most B2G emulator/device image builds), so this needs attention. I'm not sure who's around right now, but a ~50+% failure rate isn't acceptable and needs attention from *someone* ASAP.
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(jonas)
Flags: needinfo?(jocheng)
Flags: needinfo?(faramarz)
Flags: needinfo?(fabrice)
No references here. We don't usually mess with git and repo. Sounds like something changed on the automation side and needs to be backed out. Alternately, you can experiment with upgrading git and/or libcurl.
Flags: needinfo?(mwu)
:wcosta is working right now on upgrading libcurl to stop burning builds. The longer-term fix here is probably getting off CentOS6.
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(fabrice)
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
Bug 1177190: Update libcurl in docker images. r=selena

libcurl shipped with CentOS 6 doesn't support keepalive. This is causing
builds to burn.
Attachment #8627515 - Flags: review?(sdeckelmann)
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

Bug 1177190: Update libcurl in docker images. r=selena

libcurl shipped with CentOS 6 doesn't support keepalive. This is causing
builds to burn.
Attachment #8627515 - Flags: review?(sdeckelmann) → review+
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

https://reviewboard.mozilla.org/r/12255/#review10733

Ship It!
Flags: needinfo?(wcosta)
Depends on: 1178899
Emulator bustage was caused by Bug 1178899. Should be fixed now, I could run a successfully build:
https://tools.taskcluster.net/task-inspector/#a8ecyIG3QEuGN95D7BOXUg/2

Can we backout the backout?
Flags: needinfo?(wcosta) → needinfo?(cbook)
Just talked to selena, we are going to that in other way.
Flags: needinfo?(cbook)
Depends on: 1178997
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

Bug 1177190: Update libcurl in docker images. r=selenamarie

libcurl on CentOS 6 doesn't support keealive, so we upgrade it.
The approach we take to avoid breaking buildbot machines is to
grab libcurl from CentOS 7, build it on CentOS 6 and upload rpms
to S3.
Attachment #8627515 - Attachment description: MozReview Request: Bug 1177190: Update libcurl in docker images. r=selena → MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie
Attachment #8627515 - Flags: review+ → review?(sdeckelmann)
Attachment #8627515 - Flags: review?(sdeckelmann) → review?(dustin)
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

Bug 1177190: Update libcurl in docker images. r=selenamarie

libcurl on CentOS 6 doesn't support keealive, so we upgrade it.
The approach we take to avoid breaking buildbot machines is to
grab libcurl from CentOS 7, build it on CentOS 6 and upload rpms
to S3.
https://reviewboard.mozilla.org/r/12253/#review10881

::: testing/docker/b2g-build/Dockerfile:18
(Diff revision 2)
> +  cd -

You should be able to just 'yum install $url' which avoids loading the RPMs onto disk
Comment on attachment 8627515 [details]
MozReview Request: Bug 1177190: Update libcurl in docker images. r=selenamarie

It'd be good to have some comments in there regarding why these aren't installed from a yum repo, too.

FWIW, there's another option to enforce keepalive for everything:
  http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/#libkeepalive
why the linux kernel doesn't do this by default, I don't know.  Would the Internet collapse from an extra TCP round trip every 5 minutes?  The number of TCP connections that last that long is a vanishingly small portion of all TCP connections.  But I digress..
Attachment #8627515 - Flags: review?(dustin) → review+
https://hg.mozilla.org/mozilla-central/rev/4bfe1c223646
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Flags: needinfo?(jocheng)
Flags: needinfo?(faramarz)
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: