MozMEAO SRE Status Report - 5/16/2017
Here’s what happened on the MozMEAO SRE team from May 9th - May 16th.
Current work
Bedrock (mozilla.org)
Work continues on moving Bedrock to our Kubernetes infrastructure.
Postgres/RDS provisioning
A Postgres RDS instance has already been provisioned in us-east-1
for our Virginia cluster, and another was created in ap-northeast-1
to support the Tokyo cluster. Additionally, development, staging, and production databases were created in each region. This process was documented here.
Elastic Load Balancer (ELB) provisioning
We’ve automated the creation of ELB’s for Bedrock in Virginia and Tokyo. There are still a few more wrinkles to sort out, but the infra is mostly in place to begin The Big Move to Kubernetes.
MDN
Work continues to analyze the Apache httpd configuration from the current SCL3 datacenter config.
- The last remaining rewrites have been implemented.
- John Whitlock reviewed httpd icon configuration and accepted locales and determined that no migration is needed.
- More work is needed to ensure that mime types are correct and consistent in Django.
- httpd
Alias
directives need to be implemented in Django. - Ryan Johnson has submitted a PR for liveness/readiness endpoints.
Downtime incident 2017-05-13
On May 13th, 2017 22:49 -22:55, New Relic reported that MDN was unavailable. The site was slow to respond to page views, and was running long database queries. Log analysis show a security scan of our database-intensive endpoints.
On May 14th, 2017, there were high I/O alerts on 3 of the 6 production web servers. This was not reflected in high traffic or a decrease in responsiveness.
Basket
The FxA team would like to send events (FXA_IDs) to Basket and Salesforce, and needed SQS queues in order to move forward. We automated the provisioning of dev/stage/prod SQS queues, and passed off credentials to the appropriate engineers.
The FxA team requested cross AWS account access to the new SQS queues. Access has been automated and granted via this PR.
Snippets
Snippets Stats Collection Issues 2017-04-10
A planned configuration change to add a Route 53 Traffic Policy for the snippets stats collection service caused a day’s worth of data to not be collected due to a SSL certificate error.
Careers
Autoscaling
In order to take advantage of Kubernetes cluster and pod autoscaling (which we’ve documented here), app memory and CPU limits were set for careers.mozilla.org in our Virginia and Tokyo clusters. This allows the careers site to scale up and down based on load.
Acceptance tests
Giorgos Logiotatidis added acceptance tests, which contains a simple bash script and additional Jenkinsfile stages to check if careers.mozilla.org pages return valid responses after deployment.
Downtime incident 2017-04-11
A typo was merged and pushed to production and caused a couple of minutes of downtime before we rolled-back to the previous version.
Decommission openwebdevice.org status
openwebdevice.org will remain operational in http-only mode until the board approves decommissioning. A timeline is unavailable.
Future work
Nucleus
We’re planning to move nucleus to Kubernetes, and then proceed to decommissioning current nucleus infra.
Basket
We’re planning to move basket to Kubernetes shortly after the nucleus migration, and then proceed to decommissioning existing infra.