MozMEAO SRE Status Report - December 5, 2017

Here’s what happened on the MozMEAO SRE team from November 14th - December 5th.

Current work

SUMO

Work continues on the SUMO move to AWS. We’ve provisioned a small RDS MySQL instance in AWS for development and tried importing a production snapshot. The import took 30 hours on a db.t2.small instance, so we experimented with temporarily scaling the RDS instance to a an db.m4.xlarge. The import is now expected to complete in 5 hours.

We will investigate if incremental backup/restore is an option for the production transition.

MDN

MDN had several short downtime events in November, caused by heavy load due to scraping. Our K8s liveness and readiness probes often forced pods to restart when MySQL was slow to respond.

Several readiness and liveness probe changes were issued by @escattone and @jwhitlock to help alleviate the issue:

The November 2017 Kuma report has additional details.

We now have a few load balancer infrastructure tests for MDN, implemented in this pull request.

MDN Interactive-examples

Caching is now more granular due to setting different cache times for different assets.

Bedrock

Bedrock is transitioning to a local Sqlite DB and clock process in every container. This removes the dependency on RDS and makes running Bedrock cheaper. In preparation for this change, S3 buckets have been created for dev, stage and prod.