MozMEAO SRE Status Report - 5/16/2017

Here’s what happened on the MozMEAO SRE team from May 9th - May 16th.

Current work

Bedrock (mozilla.org)

Work continues on moving Bedrock to our Kubernetes infrastructure.

Postgres/RDS provisioning

A Postgres RDS instance has already been provisioned in us-east-1 for our Virginia cluster, and another was created in ap-northeast-1 to support the Tokyo cluster. Additionally, development, staging, and production databases were created in each region. This process was documented here.

Elastic Load Balancer (ELB) provisioning

We’ve automated the creation of ELB’s for Bedrock in Virginia and Tokyo. There are still a few more wrinkles to sort out, but the infra is mostly in place to begin The Big Move to Kubernetes.

MDN

Work continues to analyze the Apache httpd configuration from the current SCL3 datacenter config.

Downtime incident 2017-05-13

On May 13th, 2017 22:49 -22:55, New Relic reported that MDN was unavailable. The site was slow to respond to page views, and was running long database queries. Log analysis show a security scan of our database-intensive endpoints.

On May 14th, 2017, there were high I/O alerts on 3 of the 6 production web servers. This was not reflected in high traffic or a decrease in responsiveness.

Basket

The FxA team would like to send events (FXA_IDs) to Basket and Salesforce, and needed SQS queues in order to move forward. We automated the provisioning of dev/stage/prod SQS queues, and passed off credentials to the appropriate engineers.

The FxA team requested cross AWS account access to the new SQS queues. Access has been automated and granted via this PR.

Snippets

Snippets Stats Collection Issues 2017-04-10

A planned configuration change to add a Route 53 Traffic Policy for the snippets stats collection service caused a day’s worth of data to not be collected due to a SSL certificate error.

Careers

Autoscaling

In order to take advantage of Kubernetes cluster and pod autoscaling (which we’ve documented here), app memory and CPU limits were set for careers.mozilla.org in our Virginia and Tokyo clusters. This allows the careers site to scale up and down based on load.

Acceptance tests

Giorgos Logiotatidis added acceptance tests, which contains a simple bash script and additional Jenkinsfile stages to check if careers.mozilla.org pages return valid responses after deployment.

Downtime incident 2017-04-11

A typo was merged and pushed to production and caused a couple of minutes of downtime before we rolled-back to the previous version.

Decommission openwebdevice.org status

openwebdevice.org will remain operational in http-only mode until the board approves decommissioning. A timeline is unavailable.

Future work

Nucleus

We’re planning to move nucleus to Kubernetes, and then proceed to decommissioning current nucleus infra.

Basket

We’re planning to move basket to Kubernetes shortly after the nucleus migration, and then proceed to decommissioning existing infra.