Closed
Bug 1041763
Opened 10 years ago
Closed 10 years ago
upgrade ec2 linux64 test masters from m3.medium to m3.large
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlund, Assigned: jlund)
References
Details
puppet reports show a fork error for bb117 arr suspects it ran out of memory as a likely cause
Assignee | ||
Comment 1•10 years ago
|
||
this is an ec2 instance setup as a linux test master state of machine /proc/cpuinfo and /proc/meminfo: https://www.irccloud.com/pastebin/QrZXpf1r
Comment 2•10 years ago
|
||
Jul 21 13:21:22 buildbot-master117 collectd[1306]: ethstat plugin: No stats available for eth0 Jul 21 13:25:11 buildbot-master117 puppet-agent[30562]: Enabling Puppet. Jul 21 13:25:51 buildbot-master117 kernel: puppet invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0 Jul 21 13:25:51 buildbot-master117 kernel: puppet cpuset=/ mems_allowed=0 Jul 21 13:25:51 buildbot-master117 kernel: Pid: 30840, comm: puppet Not tainted 2.6.32-220.el6.x86_64 #1 Jul 21 13:25:51 buildbot-master117 kernel: Call Trace: Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff810c2cb1>] ? cpuset_print_task_mems_allowed+0x91/0xb0 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81113a30>] ? dump_header+0x90/0x1b0 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81113eba>] ? oom_kill_process+0x8a/0x2c0 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81113df1>] ? select_bad_process+0xe1/0x120 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81114310>] ? out_of_memory+0x220/0x3c0 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8112402e>] ? __alloc_pages_nodemask+0x89e/0x940 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81158c7a>] ? alloc_pages_vma+0x9a/0x150 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8113aa9d>] ? do_wp_page+0xfd/0x8d0 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8113b6b9>] ? __do_fault+0x449/0x510 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81004a49>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8113ba4d>] ? handle_pte_fault+0x2cd/0xb50 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff814efa5a>] ? error_exit+0x2a/0x60 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8100ba9d>] ? retint_restore_args+0x5/0x6 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1010 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1010 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81004a49>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8113c4b4>] ? handle_mm_fault+0x1e4/0x2b0 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81042b39>] ? __do_page_fault+0x139/0x480 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8100988e>] ? __switch_to+0x26e/0x320 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81055d1f>] ? finish_task_switch+0x4f/0xe0 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff814eca20>] ? thread_return+0x4e/0x77e Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff810d447c>] ? audit_syscall_entry+0xc/0x2a0 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff814f246e>] ? do_page_fault+0x3e/0xa0 Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff814ef825>] ? page_fault+0x25/0x30 Jul 21 13:25:51 buildbot-master117 kernel: Mem-Info: Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA per-cpu: Jul 21 13:25:51 buildbot-master117 kernel: CPU 0: hi: 0, btch: 1 usd: 0 Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA32 per-cpu: Jul 21 13:25:51 buildbot-master117 kernel: CPU 0: hi: 186, btch: 31 usd: 30 Jul 21 13:25:51 buildbot-master117 kernel: active_anon:777957 inactive_anon:153842 isolated_anon:32 Jul 21 13:25:51 buildbot-master117 kernel: active_file:2 inactive_file:58 isolated_file:0 Jul 21 13:25:51 buildbot-master117 kernel: unevictable:0 dirty:18 writeback:0 unstable:0 Jul 21 13:25:51 buildbot-master117 kernel: free:4011 slab_reclaimable:2209 slab_unreclaimable:4790 Jul 21 13:25:51 buildbot-master117 kernel: mapped:1 shmem:74 pagetables:5117 bounce:0 Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA free:8256kB min:16kB low:20kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:8252kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Jul 21 13:25:51 buildbot-master117 kernel: lowmem_reserve[]: 0 3771 3771 3771 Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA32 free:7788kB min:7848kB low:9808kB high:11772kB active_anon:3111828kB inactive_anon:615368kB active_file:8kB inactive_file:232kB unevictable:0kB isolated(anon):128kB isolated(file):0kB present:3862240kB mlocked:0kB dirty:72kB writeback:0kB mapped:4kB shmem:296kB slab_reclaimable:8836kB slab_unreclaimable:19160kB kernel_stack:792kB pagetables:20468kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no Jul 21 13:25:51 buildbot-master117 kernel: lowmem_reserve[]: 0 0 0 0 Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA: 2*4kB 1*8kB 1*16kB 1*32kB 0*64kB 2*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8256kB Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA32: 423*4kB 0*8kB 1*16kB 2*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 7788kB Jul 21 13:25:51 buildbot-master117 kernel: 3495 total pagecache pages Jul 21 13:25:51 buildbot-master117 kernel: 3361 pages in swap cache Jul 21 13:25:51 buildbot-master117 kernel: Swap cache stats: add 8680784, delete 8677423, find 7779912/8400558 Jul 21 13:25:51 buildbot-master117 kernel: Free swap = 0kB Jul 21 13:25:51 buildbot-master117 kernel: Total swap = 4194296kB Jul 21 13:25:51 buildbot-master117 kernel: 983039 pages RAM Jul 21 13:25:51 buildbot-master117 kernel: 20652 pages reserved Jul 21 13:25:51 buildbot-master117 kernel: 10150 pages shared Jul 21 13:25:51 buildbot-master117 kernel: 945466 pages non-shared Jul 21 13:25:51 buildbot-master117 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name Jul 21 13:25:51 buildbot-master117 kernel: [ 304] 0 304 2712 1 0 -17 -1000 udevd Jul 21 13:25:51 buildbot-master117 kernel: [ 526] 0 526 2711 1 0 -17 -1000 udevd Jul 21 13:25:51 buildbot-master117 kernel: [ 772] 0 772 2293 39 0 0 0 dhclient Jul 21 13:25:51 buildbot-master117 kernel: [ 812] 0 812 1553 2 0 0 0 portreserve Jul 21 13:25:51 buildbot-master117 kernel: [ 819] 0 819 62187 91 0 0 0 rsyslogd Jul 21 13:25:51 buildbot-master117 kernel: [ 844] 32 844 4756 16 0 0 0 rpcbind Jul 21 13:25:51 buildbot-master117 kernel: [ 861] 81 861 5359 41 0 0 0 dbus-daemon Jul 21 13:25:51 buildbot-master117 kernel: [ 885] 68 885 6218 66 0 0 0 hald Jul 21 13:25:51 buildbot-master117 kernel: [ 886] 0 886 4539 2 0 0 0 hald-runner Jul 21 13:25:51 buildbot-master117 kernel: [ 1022] 500 1022 2314 1 0 0 0 run_pulse_publi Jul 21 13:25:51 buildbot-master117 kernel: [ 1024] 500 1024 30060 3373 0 0 0 python Jul 21 13:25:51 buildbot-master117 kernel: [ 1150] 38 1150 7552 34 0 0 0 ntpd Jul 21 13:25:51 buildbot-master117 kernel: [ 1257] 0 1257 2304 18 0 0 0 abrt-dump-oops Jul 21 13:25:51 buildbot-master117 kernel: [ 1266] 0 1266 29709 2 0 0 0 abrtd Jul 21 13:25:51 buildbot-master117 kernel: [ 1274] 0 1274 29312 23 0 0 0 crond Jul 21 13:25:51 buildbot-master117 kernel: [ 1304] 0 1304 1543 1 0 0 0 collectdmon Jul 21 13:25:51 buildbot-master117 kernel: [ 1306] 0 1306 168278 159 0 0 0 collectd Jul 21 13:25:51 buildbot-master117 kernel: [ 1333] 0 1333 1029 2 0 0 0 mingetty Jul 21 13:25:51 buildbot-master117 kernel: [ 1335] 0 1335 1029 2 0 0 0 mingetty Jul 21 13:25:51 buildbot-master117 kernel: [ 1337] 0 1337 1032 2 0 0 0 agetty Jul 21 13:25:51 buildbot-master117 kernel: [ 1339] 0 1339 1029 2 0 0 0 mingetty Jul 21 13:25:51 buildbot-master117 kernel: [ 1341] 0 1341 1029 2 0 0 0 mingetty Jul 21 13:25:51 buildbot-master117 kernel: [ 1343] 0 1343 1029 2 0 0 0 mingetty Jul 21 13:25:51 buildbot-master117 kernel: [ 1345] 0 1345 1029 2 0 0 0 mingetty Jul 21 13:25:51 buildbot-master117 kernel: [ 2134] 0 2134 16017 36 0 0 0 sshd Jul 21 13:25:51 buildbot-master117 kernel: [22024] 500 22024 2126971 883209 0 0 0 buildbot Jul 21 13:25:51 buildbot-master117 kernel: [ 532] 500 532 26538 1 0 0 0 run_command_run Jul 21 13:25:51 buildbot-master117 kernel: [ 533] 500 533 49062 537 0 0 0 python Jul 21 13:25:51 buildbot-master117 kernel: [31497] 0 31497 19669 22 0 0 0 master Jul 21 13:25:51 buildbot-master117 kernel: [31500] 89 31500 19732 18 0 0 0 qmgr Jul 21 13:25:51 buildbot-master117 kernel: [29614] 497 29614 10248 96 0 0 0 nrpe Jul 21 13:25:51 buildbot-master117 kernel: [24996] 89 24996 19689 203 0 0 0 pickup Jul 21 13:25:51 buildbot-master117 kernel: [30559] 0 30559 34004 80 0 0 0 crond Jul 21 13:25:51 buildbot-master117 kernel: [30560] 0 30560 2321 37 0 0 0 sh Jul 21 13:25:51 buildbot-master117 kernel: [30575] 0 30575 47115 17055 0 0 0 puppet Jul 21 13:25:51 buildbot-master117 kernel: [30727] 500 30727 84516 5753 0 0 0 python Jul 21 13:25:51 buildbot-master117 kernel: [30728] 500 30728 59548 4977 0 0 0 python Jul 21 13:25:51 buildbot-master117 kernel: [30832] 500 30832 61659 5064 0 0 0 python Jul 21 13:25:51 buildbot-master117 kernel: [30839] 500 30839 14943 182 0 0 0 ssh Jul 21 13:25:51 buildbot-master117 kernel: [30840] 0 30840 47118 17248 0 0 0 puppet Jul 21 13:25:51 buildbot-master117 kernel: Out of memory: Kill process 22024 (buildbot) score 960 or sacrifice child Jul 21 13:25:51 buildbot-master117 kernel: Killed process 22024, UID 500, (buildbot) total-vm:8507884kB, anon-rss:3532832kB, file-rss:4kB
Assignee | ||
Comment 3•10 years ago
|
||
resolution: this instance was a m3.medium so we, 1) stopped instance 2) updated to m3.large 3) started instance based upon: - this machine has been troublesome before (no bugs for reference) - we have some masters ec2 linux64 types that are m3.medium and some m1.large we are going to switch the instance types of all m3.medium to m3.large like we did for 117 this will require: 1) graceful shutdown of buildbot 2) disabling in slavealloc 3) stopping instance 4) changing instance type 5) enabling in slavealloc 6) starting instance This will be done on: bm113-tests1-linux64 bm114-tests1-linux64 bm115-tests1-linux64 bm116-tests1-linux64 bm118-tests1-linux64
Assignee | ||
Comment 4•10 years ago
|
||
this work will be done tomorrow now.
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → jlund
Summary: buildbot-master117 stopped running jobs → upgrade ec2 linux64 test masters from m3.medium to m3.large
Assignee | ||
Comment 5•10 years ago
|
||
due to treee outages this was not completed. I will try again tomorrow or else friday
Assignee | ||
Comment 6•10 years ago
|
||
bm115-tests1-linux64 has been upgraded. It took over 2 hours to stop. I ended up forcing with a 'make stop' once it stopped taking a job for >30min. We will continue tomorrow with the rest
Assignee | ||
Comment 7•10 years ago
|
||
this was not completed due to tree closure and a 3 hour reconfig on friday. bm113-tests1-linux64 bm114-tests1-linux64 bm116-tests1-linux64 bm118-tests1-linux64 still need to be done.
Comment 8•10 years ago
|
||
Taking advantage of the quiet period over the weekend to do the rest. bm113 and bm116 are disabled in slavealloc and in graceful shutdown.
Comment 9•10 years ago
|
||
bm113, bm114, and bm116 are done. They required a 'make stop' when the master thought they were still working on jobs, but no slaves connected or anything claimed in buildbot_schedulers.buildrequests. bm118 has six legit jobs to finish (disabled & graceful shutdown).
Comment 10•10 years ago
|
||
bm118 is done too. All finished ?
Assignee | ||
Comment 11•10 years ago
|
||
yup. thanks for taking advantage of quiet weekend. this will help buildduty a lot in terms of maintenance. I'll update the spreadsheet and put your name beside it.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•