Closed Bug 1425438 Opened 6 years ago Closed 6 years ago

Idle time inaccurate when (and after) computer sleeps/hibernates/is not being watched

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Unassigned)

References

Details

This is a weird one - see the following log:


2017/12/14 22:04:07 No task claimed. Idle for 2h48m58.685070362s.
2017/12/14 22:05:08 No task claimed. Idle for 2h49m59.59994974s.
2017/12/14 22:06:09 No task claimed. Idle for 2h51m1.175527692s.
2017/12/14 22:07:27 Error: (Intermittent) HTTP response code 503
HTTP/1.1 503 Service Unavailable
Content-Length: 506
Cache-Control: no-cache, no-store
Connection: keep-alive
Content-Type: text/html; charset=utf-8
Date: Thu, 14 Dec 2017 21:07:26 GMT
Server: Cowboy

<!DOCTYPE html>
	<html>
	  <head>
		<meta name="viewport" content="width=device-width, initial-scale=1">
		<meta charset="utf-8">
		<title>Application Error</title>
		<style media="screen">
		  html,body,iframe {
			margin: 0;
			padding: 0;
		  }
		  html,body {
			height: 100%;
			overflow: hidden;
		  }
		  iframe {
			width: 100%;
			height: 100%;
			border: 0;
		  }
		</style>
	  </head>
	  <body>
		<iframe src="//www.herokucdn.com/error-pages/application-error.html"></iframe>
	  </body>
	</html>
2017/12/14 22:07:48 No task claimed. Idle for 2h52m39.646402239s.
2017/12/14 23:34:33 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55564->54.225.234.26:443: read: operation timed out
2017/12/15 00:28:06 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55618->50.16.212.167:443: read: operation timed out
2017/12/15 01:38:44 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55677->54.221.201.38:443: read: operation timed out
2017/12/15 01:39:09 No task claimed. Idle for 3h7m32.254364563s.
2017/12/15 02:27:58 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55770->50.16.212.167:443: read: operation timed out
2017/12/15 02:33:56 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55852->54.221.201.38:443: read: connection reset by peer
2017/12/15 02:34:17 No task claimed. Idle for 3h13m36.563701508s.
2017/12/15 03:20:12 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55865->54.225.234.26:443: read: operation timed out
2017/12/15 04:11:12 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:55914->50.16.212.167:443: read: operation timed out
2017/12/15 04:11:34 No task claimed. Idle for 3h23m8.189155353s.
2017/12/15 04:59:48 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56005->54.225.234.26:443: read: operation timed out
2017/12/15 05:00:09 No task claimed. Idle for 3h28m3.931194413s.
2017/12/15 05:56:25 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56091->54.225.234.26:443: read: operation timed out
2017/12/15 06:47:29 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56188->50.16.212.167:443: read: operation timed out
2017/12/15 06:59:24 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: dial tcp 54.221.201.38:443: connect: network is down
2017/12/15 06:59:49 No task claimed. Idle for 3h37m43.412501247s.
2017/12/15 08:01:41 Error: Post https://queue.taskcluster.net/v1/claim-work/pmoore-manual/mac-os-x: read tcp 192.168.2.102:56244->54.225.234.26:443: read: operation timed out
2017/12/15 08:02:06 No task claimed. Idle for 3h43m19.58027416s.
2017/12/15 08:03:28 No task claimed. Idle for 3h44m28.14050157s.
2017/12/15 08:04:30 No task claimed. Idle for 3h45m29.91415128s.
2017/12/15 08:05:31 No task claimed. Idle for 3h46m30.986977671s.
2017/12/15 08:06:33 No task claimed. Idle for 3h47m32.733614223s.



No task was claimed during this period, so the idle times are inaccurate. Reverse engineering the last active time, we get:


2017/12/14 22:04:07 No task claimed. Idle for 2h48m58.685070362s. => idle since 19:15:08
2017/12/14 22:05:08 No task claimed. Idle for 2h49m59.59994974s.  => idle since 19:15:08
2017/12/14 22:06:09 No task claimed. Idle for 2h51m1.175527692s.  => idle since 19:15:08
2017/12/14 22:07:48 No task claimed. Idle for 2h52m39.646402239s. => idle since 19:15:08
2017/12/15 01:39:09 No task claimed. Idle for 3h7m32.254364563s.  => idle since 22:31:37
2017/12/15 02:34:17 No task claimed. Idle for 3h13m36.563701508s. => idle since 23:20:40
2017/12/15 04:11:34 No task claimed. Idle for 3h23m8.189155353s.  => idle since 00:48:26
2017/12/15 05:00:09 No task claimed. Idle for 3h28m3.931194413s.  => idle since 01:32:05
2017/12/15 06:59:49 No task claimed. Idle for 3h37m43.412501247s. => idle since 03:22:06
2017/12/15 08:02:06 No task claimed. Idle for 3h43m19.58027416s.  => idle since 04:18:46
2017/12/15 08:03:28 No task claimed. Idle for 3h44m28.14050157s.  => idle since 04:19:00
2017/12/15 08:04:30 No task claimed. Idle for 3h45m29.91415128s.  => idle since 04:19:00
2017/12/15 08:05:31 No task claimed. Idle for 3h46m30.986977671s. => idle since 04:19:00
2017/12/15 08:06:33 No task claimed. Idle for 3h47m32.733614223s. => idle since 04:19:00


During the period where the computer was possibly in some kind of hibernation mode, the idle times seem to get reduced. When the computer is fully awake around 08:00, the idle times don't recover, but remain relative to the previous calculation.

This is strange, because idle times are calculated *absolutely* rather than being cumulatively added. There is an absolute timestamp stored of when the worker was last active[1], and at each logging entry, the duration between the current time and the last active time is calculated[2]. The only time the last active timestamp is updated, is when a new task is claimed[1].

This is a mystery to me at the moment. Even if there were some very slow threads during the hibernation time, I would expect that things would recover when the machine went out of hibernation mode. The fact the timestamps are accurate after the machine wakes, makes me think it isn't a bug in go's interpretation of the current time.

One troubleshooting option I have is to output the last active timestamp in the log line too, and to log every time it is set or updated. This might help diagnose the problem.

[1] https://github.com/taskcluster/generic-worker/blob/95e83aa1f0c7d8aeb6e905fcfb9b8492e468b816/main.go#L608
[2] https://github.com/taskcluster/generic-worker/blob/95e83aa1f0c7d8aeb6e905fcfb9b8492e468b816/main.go#L615
So, having left this running, I now have the data I was after. Here is an example:

2017/12/18 03:14:49 No task claimed. Idle for 6h9m25.196602544s (since 2017-12-15 21:54:16.806064 +0100 CET m=+0.015021164).

Clearly the duration between "2017-12-15 21:54:16.806064 +0100 CET m=+0.015021164" and "2017/12/18 03:14:49" is more than 6h9m25.196602544s.

This logging output comes from link in comment 1 above.
Commit pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/ecc115fd90d9984320428797e6b4223283b41225
Bug 1425438 - even more logging to see why idle time calculation might be incorrect
OK this is looking more and more like a go bug:


2017/12/19 13:43:49 No task claimed at 2017-12-19 13:43:49.983474 +0100 CET m=+2339.061368752. Idle for 23m40.900620113s (since 2017-12-19 12:14:48.854239 +0100 CET m=+918.160748639).


which comes essentially from:

> now := time.Now()
> idleTime := now.Sub(lastActive)
> log.Printf("No task claimed at %v. Idle for %v (since %v).", now, idleTime, lastActive)

=>

   lastActive = 2017-12-19 12:14:48.854239 +0100 CET m=+918.160748639
          now = 2017-12-19 13:43:49.983474 +0100 CET m=+2339.061368752
     idleTime = 23m40.900620113s

But idleTime should be the duration between the two timestamps, i.e. around 1h29m1s.

I'll see if I can create a small program to reproduce issue, and submit a golang bug report.
Commits pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/569e52e9fcf31463dac4e69ec669450e47b7f122
Revert "Bug 1425438 - even more logging to see why idle time calculation might be incorrect"

This reverts commit ecc115fd90d9984320428797e6b4223283b41225.

https://github.com/taskcluster/generic-worker/commit/7dfbec08660832e7cbb5eab4c3cace702456b2ca
Revert "Bug 1425438 - extra debugging info"

This reverts commit 897af5679d299859bfdeb6c8126be3d2e87c4654.

https://github.com/taskcluster/generic-worker/commit/e7aada1a7843b53c93a6568af33e9c9c386f3f6e
Bug 1425438 - use wall time durations rather than monotonic clock durations in time arithmetic
See Also: → 1424986
Fixed in generic-worker 10.4.1.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.