Closed Bug 1252248 Opened 8 years ago Closed 8 years ago

Add more tst-linux64, tst-emulator64 capacity in AWS

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: vciobancai)

References

Details

Attachments

(9 files, 3 obsolete files)

50.38 KB, text/plain
kmoir
: review+
vciobancai
: checked-in+
Details
26.06 KB, text/plain
kmoir
: review+
vciobancai
: checked-in+
Details
703 bytes, patch
kmoir
: review+
vciobancai
: checked-in+
Details | Diff | Splinter Review
53 bytes, text/x-github-pull-request
kmoir
: review+
Details | Review
4.62 KB, patch
kmoir
: review+
vciobancai
: checked-in+
Details | Diff | Splinter Review
1.02 KB, patch
kmoir
: review+
vciobancai
: checked-in+
Details | Diff | Splinter Review
259 bytes, text/plain
kmoir
: review+
vciobancai
: checked-in+
Details
1017 bytes, patch
dustin
: review+
kmoir
: checked-in+
Details | Diff | Splinter Review
1.80 KB, patch
aselagea
: review+
Details | Diff | Splinter Review
We're seeing significant wait times on tst-linux64 and tst-emulator64 platforms. We should add more machines / masters / subnets as appropriate to increase each pool by about 25%. This translates into 500 more tst-linux64 machines and 250 more tst-emulator64 machines.

See also:
https://bugzilla.mozilla.org/show_bug.cgi?id=1090568
https://bugzilla.mozilla.org/show_bug.cgi?id=1143901
https://bugzilla.mozilla.org/show_bug.cgi?id=1204756
Assignee: nobody → vlad.ciobancai
Attached file bug1252248_tst-linux64_emulator.csv (obsolete) —
Attached you can find the csv file for tst-linux64 where I added 250 spot instances in us-east-1 AZ and the rest of them in us-west-2 AZ
Attachment #8725176 - Flags: review?(kmoir)
Created the following pull request https://github.com/mozilla/build-cloud-tools/pull/181 but failed, can somebody help me why ? The change update the watch_pending.cfg file to increase the tst-linux64
Updated the csv file
Attachment #8725176 - Attachment is obsolete: true
Attachment #8725176 - Flags: review?(kmoir)
Attachment #8725186 - Flags: review?(kmoir)
Attached you can find the csv file for tst-emulator64-spot where I added 125 spot instances in us-east-1 AZ and the rest of them in us-west-2 AZ
Attachment #8725187 - Flags: review?(kmoir)
Attached you can find the production_config.py updated for both slave types (tst-emulator and tst-linux64)
Attachment #8725189 - Flags: review?(kmoir)
For the cloud tools github pull request, there seems to be an issue with travis/tox that is causing the failure.  If I clone your repo into my local docker instance and run tox, the tests pass.  Not sure what is happening there, I'm looking. As for that pull request, you will need to add additional subnets to account for the increase in the number of machines we are going to add so we don't run out of ip addresses. Bug 1165432 has an example of this change.  You will also need to look at the master load for the existing masters that serve these instance types and determine if we need to add more masters since we are significantly increasing the amount of instances that will attach to them.
Attachment #8725189 - Flags: review?(kmoir) → review+
Attachment #8725187 - Flags: review?(kmoir) → review+
Attachment #8725186 - Flags: review?(kmoir) → review+
Created the following pull request in order to add new subnet https://github.com/mozilla/build-cloud-tools/pull/182
You also need to update configs/tst-linux64 and configs/tst-emulator64 to include the new subnets as appropriate

see 
https://bugzilla.mozilla.org/show_bug.cgi?id=1165432#c1
(In reply to Kim Moir [:kmoir] from comment #8)
> You also need to update configs/tst-linux64 and configs/tst-emulator64 to
> include the new subnets as appropriate
> 
> see 
> https://bugzilla.mozilla.org/show_bug.cgi?id=1165432#c1

From what I understood first the subnets.yml needs to be updated and after that the script needs to be run. 

First I wanted to be sure that the new entries that I added in subnets.yml are OK. 
@Kim please let me know if the entries are OK in order to run the script
Flags: needinfo?(kmoir)
answered questions on github pull request
Flags: needinfo?(kmoir)
The following subnets have been created in us-west-2 without any issue : subnet-47a8b830, subnet-797f5120,subnet-330bf457 and subnet-5fa8b828 but when the script tried to create in us-east-1 I'm receiving the following error:

2016-03-03 00:12:37,486 - 10.132.60.0/22 - IPSet(['10.132.60.0/22']) isn't covered by any subnets
2016-03-03 00:12:37,487 - creating subnet 10.132.63.0/24 in us-east-1a/vpc-b42100df
(y/N) y
2016-03-03 00:12:48,157 - creating subnet
2016-03-03 00:12:48,306 - 400 Bad Request
2016-03-03 00:12:48,306 - <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidSubnet.Range</Code><Message>The CIDR '10.132.63.0/24' is invalid.</Message></Error></Errors><RequestID>8cc8c5da-a3f2-42e5-8d42-1d38321034c6</RequestID></Response>
Traceback (most recent call last):
  File "scripts/aws_manage_subnets.py", line 121, in <module>
    main()
  File "scripts/aws_manage_subnets.py", line 117, in main
    sync_subnets(conn, config[region])
  File "scripts/aws_manage_subnets.py", line 94, in sync_subnets
    s = conn.create_subnet(vpc_id, c, z.name)
  File "/builds/aws_manager/lib/python2.7/site-packages/boto/vpc/__init__.py", line 1166, in create_subnet
    return self.get_object('CreateSubnet', params, Subnet)
  File "/builds/aws_manager/lib/python2.7/site-packages/boto/connection.py", line 1177, in get_object
    raise self.ResponseError(response.status, response.reason, body)
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
<?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidSubnet.Range</Code><Message>The CIDR '10.132.63.0/24' is invalid.</Message></Error></Errors><RequestID>8cc8c5da-a3f2-42e5-8d42-1d38321034c6</RequestID></Response>
Found the issue, created another pull request to resolve it https://github.com/mozilla/build-cloud-tools/pull/184
Created the following subnets 
- us-east-1 : subnet-2e49df58, subnet-68882f42, subnet-8dd818e8 and subnet-519b3209
- us-west-2 : subnet-47a8b830, subnet-797f5120, subnet-330bf457 and subnet-5fa8b828
Attachment #8726233 - Flags: review?(kmoir)
Attachment #8726233 - Flags: review?(kmoir) → review+
Attachment #8725189 - Flags: checked-in+
Attachment #8725186 - Flags: checked-in+
Attachment #8725187 - Flags: checked-in+
Slaves added in slavealloc DB
Attached you can find the production masters json file with the new two buildbot masters
Attachment #8726262 - Flags: review?(kmoir)
Attached you can find the moco-nodes file with the new two buildbotmasters
Attachment #8726263 - Flags: review?(kmoir)
Attachment #8726263 - Flags: review?(kmoir) → review+
Attachment #8726262 - Flags: review?(kmoir) → review+
Attachment #8726262 - Flags: checked-in+
Attachment #8726263 - Flags: checked-in+
Created two new buildbot-masters 
- buildbot-master130.bb.releng.use1.mozilla.com
- buildbot-master131.bb.releng.usw2.mozilla.com

Both of them have been added in inventory
Attached you can find the csv for both buildbot-masters in order to be inserted in slavealloc DB
Attachment #8726644 - Flags: review?(kmoir)
At this step https://wiki.mozilla.org/ReleaseEngineering/How_To/Setup_buildbot_masters_in_AWS#IT a bug needs to be created to add both masters in nagios. The example is deprecated and I was not able to find the component to create the bug.

Amy can you please help me with the details in order to create a new bug ?
Flags: needinfo?(arich)
Attachment #8726644 - Flags: review?(kmoir) → review+
Attachment #8726644 - Flags: checked-in+
Depends on: 1253601
Flags: needinfo?(arich)
(In reply to Vlad Ciobancai [:vladC] from comment #20)
> At this step
> https://wiki.mozilla.org/ReleaseEngineering/How_To/
> Setup_buildbot_masters_in_AWS#IT a bug needs to be created to add both
> masters in nagios. The example is deprecated and I was not able to find the
> component to create the bug.
> 
> Amy can you please help me with the details in order to create a new bug ?

kmoir helped with a recent example, bug 1207411. I updated also the wiki page
I reverted the subnet changes because there need to be netops changes so devices there can connect to our scl3

https://github.com/mozilla/build-cloud-tools/pull/187

from #releng
2:20 PM <catlee> philor, kmoir: pretty sure now that instances in the new subnets can't talk to all the things they need to in scl3
2:21 PM <dustin> catlee: yep
2:21 PM <catlee> https://irccloud.mozilla.com/pastebin/0kV1V7Aj
 Plain Text • 6 lines raw | line numbers 
2:21 PM <catlee> dustin: is that a quick change, or no? should we backout cloud-tools for now?
2:21 PM <kmoir> huh I didn't see that we needed that change in the older bugs for adding new subnets
2:22 PM <dustin> it's not quick
2:22 PM <dustin> bug for netops, update SG's
2:22 PM <kmoir> I can backout new subnets in cloud tools
2:23 PM <catlee> thanks
2:23 PM <catlee> and I guess kill off instances in those regions
2:23 PM <catlee> I can try that
2:24 PM <dustin> kmoir: is this just adding more test subnets?
2:24 PM <dustin> or are they different than other subnets?
2:25 PM <kmoir> just new test subnets
2:28 PM <dustin> ok, so the firewall-tests change should just be in network.py to add the new subnets
2:28 PM ⇐ JoeS quit (Thunderbird@moz-ldrur9.east.verizon.net) Client exited
2:29 PM <dustin> huh, I appear not to have fw access anymore
2:29 PM <dustin> anyway, if you file a netops bug asking them to add <new IP ranges> to the address sets containing <old IP ranges> on both fw1.scl3 and fw1.releng.scl3, that should do the trick
2:30 PM <kmoir> okay
2:30 PM <dustin> you'll also need to add the new subnets to https://github.com/mozilla/build-cloud-tools/blob/master/configs/securitygroups.yml
2:30 PM <dustin> and commit that
Vladc: looking at the changes I reverted, I'm unsure why we changed an ip range instead of adding one

ie.

replacing 10.132.60.0/22 with 10.134.60.0/22:

https://github.com/mozilla/build-cloud-tools/commit/88e7a12efa5994e4c6d3b846aaed55d72648e325

shouldn't we be adding new ip ranges?
Flags: needinfo?(vlad.ciobancai)
I started a doc that describes the steps to add new subnets because I couldn't find one and it seems it's something that has tripped us up since it occurs rarely.

https://wiki.mozilla.org/ReleaseEngineering/How_To/Add_New_AWS_Subnets

Please update with the final steps as needed.
(In reply to Kim Moir [:kmoir] from comment #23)
> Vladc: looking at the changes I reverted, I'm unsure why we changed an ip
> range instead of adding one
> 
> ie.
> 
> replacing 10.132.60.0/22 with 10.134.60.0/22:
> 
> https://github.com/mozilla/build-cloud-tools/commit/
> 88e7a12efa5994e4c6d3b846aaed55d72648e325
> 
> shouldn't we be adding new ip ranges?

Kim when I created this change https://github.com/mozilla/build-cloud-tools/pull/183 I made a mistake in subnets.yml by adding this CIDR 10.132.60.0/22 in us-east-1. This ip range is wrong because in us-east-1 the ip class is 10.134.X.X .
In order to resolve the issue I created another pull request https://github.com/mozilla/build-cloud-tools/pull/184 to resolve the IP class, by changing from 10.132.60.0/22 to 10.134.60.0/22. For more details please view comment #12 and #13
Flags: needinfo?(vlad.ciobancai)
Attached file bug1242248firewall-tests.py (obsolete) —
patch for firewall-tests once netops bug is resolved
I tested on fwunit1 the patch that Kim wrote and the tests still failing
Created the following pull request https://github.com/mozilla/build-cloud-tools/pull/189 to add the new subnets in security file
merged pull request
I run again the test and still failing
The issue was the security group have not been updated. I run the script to create the security groups. Dustin run a manually update for firewalls and the test that Kim wrote worked without any issue. I updated also the wiki page with the steps.
Created this pull request https://github.com/mozilla/build-cloud-tools/pull/193 in order to add the new subnets for tst-linux64 and tst-emulator64. This pull request will be pushed in to the production only when the masters are OK
Attachment #8727546 - Attachment is obsolete: true
Attachment #8729197 - Flags: review?(dustin)
Attachment #8729197 - Flags: review?(dustin) → review+
Attachment #8729197 - Flags: checked-in+
Attached patch bug1252248tools.patch (obsolete) — Splinter Review
patch to enable new masters
Attachment #8730699 - Flags: review?(alin.selagea)
Comment on attachment 8730699 [details] [diff] [review]
bug1252248tools.patch

Looks good. Noticed that you disabled 'linux64-tsan' for bm130 but left it unchanged for bm131. Is this intended?
Attachment #8730699 - Flags: review?(alin.selagea) → review+
no that was not my intention, my Eclipse editor was dirty and the other change was not saved
Attachment #8730699 - Attachment is obsolete: true
Attachment #8730730 - Flags: review?(alin.selagea)
Attachment #8730730 - Flags: review?(alin.selagea) → review+
Attachment #8730699 - Flags: checked-in+
I've enabled the two new masters and merged the patch with the new subnets.  If this looks okay we can enable the new machines in slavelloc.
The new machines enabled in slavealloc
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: