Closed Bug 1041763 Opened 10 years ago Closed 10 years ago

upgrade ec2 linux64 test masters from m3.medium to m3.large

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: jlund)

References

Details

puppet reports show a fork error for bb117

arr suspects it ran out of memory as a likely cause
this is an ec2 instance setup as a linux test master

state of machine /proc/cpuinfo and /proc/meminfo: https://www.irccloud.com/pastebin/QrZXpf1r
Jul 21 13:21:22 buildbot-master117 collectd[1306]: ethstat plugin: No stats available for eth0
Jul 21 13:25:11 buildbot-master117 puppet-agent[30562]: Enabling Puppet.
Jul 21 13:25:51 buildbot-master117 kernel: puppet invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0
Jul 21 13:25:51 buildbot-master117 kernel: puppet cpuset=/ mems_allowed=0
Jul 21 13:25:51 buildbot-master117 kernel: Pid: 30840, comm: puppet Not tainted 2.6.32-220.el6.x86_64 #1
Jul 21 13:25:51 buildbot-master117 kernel: Call Trace:
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff810c2cb1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81113a30>] ? dump_header+0x90/0x1b0
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81113eba>] ? oom_kill_process+0x8a/0x2c0
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81113df1>] ? select_bad_process+0xe1/0x120
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81114310>] ? out_of_memory+0x220/0x3c0
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8112402e>] ? __alloc_pages_nodemask+0x89e/0x940
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81158c7a>] ? alloc_pages_vma+0x9a/0x150
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8113aa9d>] ? do_wp_page+0xfd/0x8d0
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8113b6b9>] ? __do_fault+0x449/0x510
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81004a49>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8113ba4d>] ? handle_pte_fault+0x2cd/0xb50
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff814efa5a>] ? error_exit+0x2a/0x60
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8100ba9d>] ? retint_restore_args+0x5/0x6
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1010
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8100122a>] ? hypercall_page+0x22a/0x1010
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81004a49>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8113c4b4>] ? handle_mm_fault+0x1e4/0x2b0
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81042b39>] ? __do_page_fault+0x139/0x480
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff81055d1f>] ? finish_task_switch+0x4f/0xe0
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff814eca20>] ? thread_return+0x4e/0x77e
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff810d447c>] ? audit_syscall_entry+0xc/0x2a0
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff814f246e>] ? do_page_fault+0x3e/0xa0
Jul 21 13:25:51 buildbot-master117 kernel: [<ffffffff814ef825>] ? page_fault+0x25/0x30
Jul 21 13:25:51 buildbot-master117 kernel: Mem-Info:
Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA per-cpu:
Jul 21 13:25:51 buildbot-master117 kernel: CPU    0: hi:    0, btch:   1 usd:   0
Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA32 per-cpu:
Jul 21 13:25:51 buildbot-master117 kernel: CPU    0: hi:  186, btch:  31 usd:  30
Jul 21 13:25:51 buildbot-master117 kernel: active_anon:777957 inactive_anon:153842 isolated_anon:32
Jul 21 13:25:51 buildbot-master117 kernel: active_file:2 inactive_file:58 isolated_file:0
Jul 21 13:25:51 buildbot-master117 kernel: unevictable:0 dirty:18 writeback:0 unstable:0
Jul 21 13:25:51 buildbot-master117 kernel: free:4011 slab_reclaimable:2209 slab_unreclaimable:4790
Jul 21 13:25:51 buildbot-master117 kernel: mapped:1 shmem:74 pagetables:5117 bounce:0
Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA free:8256kB min:16kB low:20kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:8252kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jul 21 13:25:51 buildbot-master117 kernel: lowmem_reserve[]: 0 3771 3771 3771
Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA32 free:7788kB min:7848kB low:9808kB high:11772kB active_anon:3111828kB inactive_anon:615368kB active_file:8kB inactive_file:232kB unevictable:0kB isolated(anon):128kB isolated(file):0kB present:3862240kB mlocked:0kB dirty:72kB writeback:0kB mapped:4kB shmem:296kB slab_reclaimable:8836kB slab_unreclaimable:19160kB kernel_stack:792kB pagetables:20468kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:160 all_unreclaimable? no
Jul 21 13:25:51 buildbot-master117 kernel: lowmem_reserve[]: 0 0 0 0
Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA: 2*4kB 1*8kB 1*16kB 1*32kB 0*64kB 2*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 8256kB
Jul 21 13:25:51 buildbot-master117 kernel: Node 0 DMA32: 423*4kB 0*8kB 1*16kB 2*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 1*4096kB = 7788kB
Jul 21 13:25:51 buildbot-master117 kernel: 3495 total pagecache pages
Jul 21 13:25:51 buildbot-master117 kernel: 3361 pages in swap cache
Jul 21 13:25:51 buildbot-master117 kernel: Swap cache stats: add 8680784, delete 8677423, find 7779912/8400558
Jul 21 13:25:51 buildbot-master117 kernel: Free swap  = 0kB
Jul 21 13:25:51 buildbot-master117 kernel: Total swap = 4194296kB
Jul 21 13:25:51 buildbot-master117 kernel: 983039 pages RAM
Jul 21 13:25:51 buildbot-master117 kernel: 20652 pages reserved
Jul 21 13:25:51 buildbot-master117 kernel: 10150 pages shared
Jul 21 13:25:51 buildbot-master117 kernel: 945466 pages non-shared
Jul 21 13:25:51 buildbot-master117 kernel: [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Jul 21 13:25:51 buildbot-master117 kernel: [  304]     0   304     2712        1   0     -17         -1000 udevd
Jul 21 13:25:51 buildbot-master117 kernel: [  526]     0   526     2711        1   0     -17         -1000 udevd
Jul 21 13:25:51 buildbot-master117 kernel: [  772]     0   772     2293       39   0       0             0 dhclient
Jul 21 13:25:51 buildbot-master117 kernel: [  812]     0   812     1553        2   0       0             0 portreserve
Jul 21 13:25:51 buildbot-master117 kernel: [  819]     0   819    62187       91   0       0             0 rsyslogd
Jul 21 13:25:51 buildbot-master117 kernel: [  844]    32   844     4756       16   0       0             0 rpcbind
Jul 21 13:25:51 buildbot-master117 kernel: [  861]    81   861     5359       41   0       0             0 dbus-daemon
Jul 21 13:25:51 buildbot-master117 kernel: [  885]    68   885     6218       66   0       0             0 hald
Jul 21 13:25:51 buildbot-master117 kernel: [  886]     0   886     4539        2   0       0             0 hald-runner
Jul 21 13:25:51 buildbot-master117 kernel: [ 1022]   500  1022     2314        1   0       0             0 run_pulse_publi
Jul 21 13:25:51 buildbot-master117 kernel: [ 1024]   500  1024    30060     3373   0       0             0 python
Jul 21 13:25:51 buildbot-master117 kernel: [ 1150]    38  1150     7552       34   0       0             0 ntpd
Jul 21 13:25:51 buildbot-master117 kernel: [ 1257]     0  1257     2304       18   0       0             0 abrt-dump-oops
Jul 21 13:25:51 buildbot-master117 kernel: [ 1266]     0  1266    29709        2   0       0             0 abrtd
Jul 21 13:25:51 buildbot-master117 kernel: [ 1274]     0  1274    29312       23   0       0             0 crond
Jul 21 13:25:51 buildbot-master117 kernel: [ 1304]     0  1304     1543        1   0       0             0 collectdmon
Jul 21 13:25:51 buildbot-master117 kernel: [ 1306]     0  1306   168278      159   0       0             0 collectd
Jul 21 13:25:51 buildbot-master117 kernel: [ 1333]     0  1333     1029        2   0       0             0 mingetty
Jul 21 13:25:51 buildbot-master117 kernel: [ 1335]     0  1335     1029        2   0       0             0 mingetty
Jul 21 13:25:51 buildbot-master117 kernel: [ 1337]     0  1337     1032        2   0       0             0 agetty
Jul 21 13:25:51 buildbot-master117 kernel: [ 1339]     0  1339     1029        2   0       0             0 mingetty
Jul 21 13:25:51 buildbot-master117 kernel: [ 1341]     0  1341     1029        2   0       0             0 mingetty
Jul 21 13:25:51 buildbot-master117 kernel: [ 1343]     0  1343     1029        2   0       0             0 mingetty
Jul 21 13:25:51 buildbot-master117 kernel: [ 1345]     0  1345     1029        2   0       0             0 mingetty
Jul 21 13:25:51 buildbot-master117 kernel: [ 2134]     0  2134    16017       36   0       0             0 sshd
Jul 21 13:25:51 buildbot-master117 kernel: [22024]   500 22024  2126971   883209   0       0             0 buildbot
Jul 21 13:25:51 buildbot-master117 kernel: [  532]   500   532    26538        1   0       0             0 run_command_run
Jul 21 13:25:51 buildbot-master117 kernel: [  533]   500   533    49062      537   0       0             0 python
Jul 21 13:25:51 buildbot-master117 kernel: [31497]     0 31497    19669       22   0       0             0 master
Jul 21 13:25:51 buildbot-master117 kernel: [31500]    89 31500    19732       18   0       0             0 qmgr
Jul 21 13:25:51 buildbot-master117 kernel: [29614]   497 29614    10248       96   0       0             0 nrpe
Jul 21 13:25:51 buildbot-master117 kernel: [24996]    89 24996    19689      203   0       0             0 pickup
Jul 21 13:25:51 buildbot-master117 kernel: [30559]     0 30559    34004       80   0       0             0 crond
Jul 21 13:25:51 buildbot-master117 kernel: [30560]     0 30560     2321       37   0       0             0 sh
Jul 21 13:25:51 buildbot-master117 kernel: [30575]     0 30575    47115    17055   0       0             0 puppet
Jul 21 13:25:51 buildbot-master117 kernel: [30727]   500 30727    84516     5753   0       0             0 python
Jul 21 13:25:51 buildbot-master117 kernel: [30728]   500 30728    59548     4977   0       0             0 python
Jul 21 13:25:51 buildbot-master117 kernel: [30832]   500 30832    61659     5064   0       0             0 python
Jul 21 13:25:51 buildbot-master117 kernel: [30839]   500 30839    14943      182   0       0             0 ssh
Jul 21 13:25:51 buildbot-master117 kernel: [30840]     0 30840    47118    17248   0       0             0 puppet
Jul 21 13:25:51 buildbot-master117 kernel: Out of memory: Kill process 22024 (buildbot) score 960 or sacrifice child
Jul 21 13:25:51 buildbot-master117 kernel: Killed process 22024, UID 500, (buildbot) total-vm:8507884kB, anon-rss:3532832kB, file-rss:4kB
resolution: this instance was a m3.medium so we, 1) stopped instance 2) updated to m3.large 3) started instance

based upon:
- this machine has been troublesome before (no bugs for reference)
- we have some masters ec2 linux64 types that are m3.medium and some m1.large


we are going to switch the instance types of all m3.medium to m3.large like we did for 117

this will require:
1) graceful shutdown of buildbot
2) disabling in slavealloc
3) stopping instance
4) changing instance type
5) enabling in slavealloc
6) starting instance

This will be done on:

bm113-tests1-linux64
bm114-tests1-linux64
bm115-tests1-linux64
bm116-tests1-linux64
bm118-tests1-linux64
this work will be done tomorrow now.
Assignee: nobody → jlund
Summary: buildbot-master117 stopped running jobs → upgrade ec2 linux64 test masters from m3.medium to m3.large
due to treee outages this was not completed. I will try again tomorrow or else friday
bm115-tests1-linux64 has been upgraded. It took over 2 hours to stop. I ended up forcing with a 'make stop' once it stopped taking a job for >30min.

We will continue tomorrow with the rest
this was not completed due to tree closure and a 3 hour reconfig on friday.

bm113-tests1-linux64
bm114-tests1-linux64
bm116-tests1-linux64
bm118-tests1-linux64

still need to be done.
Taking advantage of the quiet period over the weekend to do the rest. bm113 and bm116 are disabled in slavealloc and in graceful shutdown.
bm113, bm114, and bm116 are done. They required a 'make stop' when the master thought they were still working on jobs, but no slaves connected or anything claimed in buildbot_schedulers.buildrequests.

bm118 has six legit jobs to finish (disabled & graceful shutdown).
bm118 is done too. All finished ?
yup. thanks for taking advantage of quiet weekend.

this will help buildduty a lot in terms of maintenance. I'll update the spreadsheet and put your name beside it.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
See Also: → 1136527
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.