MozMEAO SRE Status Report - December 5, 2017
Here’s what happened on the MozMEAO SRE team from November 14th - December 5th.
Work continues on the SUMO move to AWS. We’ve provisioned a small RDS MySQL instance in AWS for development and tried importing a production snapshot. The import took 30 hours on a
db.t2.small instance, so we experimented with temporarily scaling the RDS instance to a an
db.m4.xlarge. The import is now expected to complete in 5 hours.
We will investigate if incremental backup/restore is an option for the production transition.
MDN had several short downtime events in November, caused by heavy load due to scraping. Our K8s liveness and readiness probes often forced pods to restart when MySQL was slow to respond.
Several readiness and liveness probe changes were issued by @escattone and @jwhitlock to help alleviate the issue:
- eliminate DB dependency from liveness & readiness endpoints in Kuma
- adjust liveness tests, avoid DB query in middleware
- increase threshold of failure for liveness/readiness probes
The November 2017 Kuma report has additional details.
We now have a few load balancer infrastructure tests for MDN, implemented in this pull request.
Caching is now more granular due to setting different cache times for different assets.
Bedrock is transitioning to a local Sqlite DB and clock process in every container. This removes the dependency on RDS and makes running Bedrock cheaper. In preparation for this change, S3 buckets have been created for dev, stage and prod.