MozMEAO SRE Status Report - December 5, 2017

Here’s what happened on the MozMEAO SRE team from November 14th - December 5th.

Current work

SUMO

Work continues on the SUMO move to AWS. We’ve provisioned a small RDS MySQL instance in AWS for development and tried importing a production snapshot. The import took 30 hours on a db.t2.small instance, so we experimented with temporarily scaling the RDS instance to a an db.m4.xlarge. The import is now expected to complete in 5 hours.

We will investigate if incremental backup/restore is an option for the production transition.

MDN

MDN had several short downtime events in November, caused by heavy load due to scraping. Our K8s liveness and readiness probes often forced pods to restart when MySQL was slow to respond.

Several readiness and liveness probe changes were issued by @escattone and @jwhitlock to help alleviate the issue:

The November 2017 Kuma report has additional details.

We now have a few load balancer infrastructure tests for MDN, implemented in this pull request.

MDN Interactive-examples

Caching is now more granular due to setting different cache times for different assets.

Bedrock

Bedrock is transitioning to a local Sqlite DB and clock process in every container. This removes the dependency on RDS and makes running Bedrock cheaper. In preparation for this change, S3 buckets have been created for dev, stage and prod.

Kuma Report, November 2017

Here’s what happened in November in Kuma, the engine of MDN Web Docs:

We’re planning on more of the same for December.

Done in November

Shipped the First Interactive Examples

We’ve launched the new interactive examples on 20+ pages. Try them out on the pages for the CSS property box-shadow and the JavaScript method Array.slice.

box-shadow example

We’re monitoring the page load impact of this limited rollout, and if the results are good, we have another 400 examples ready to go, thanks to Mark Boas and others. Mark also added a JavaScript Interactive Examples Contributing Guide, so that contributors can create even more.

We want the examples to be as fast as possible. Schalk Neethling improved the page load speed of the <iframe> by using preload URLs (PR 4537). Stephanie Hobson and Schalk dived into HTTP/2, and identified require.js as a potential issue for this protocol (Kuma PR 4521 and Interactive Examples PR 329). Josh Mize added appropriate caching headers for the examples and static assets (PR 326).

For the next level of speed gains, we’ll need to speed up the MDN pages themselves. One possibility is to serve developer.mozilla.org from a CDN, which will require big changes to make pages more cacheable. One issue is waffle flags, which allow us to experiment with per-user changes, at the cost of making pages uncacheable. Schalk has made steady progress in eliminating inactive waffle flag experiments, and this work will continue into December.

Continued Migration of Browser Compatibility Data

The Browser Compatibility Data project was the most active MDN project in November. 36.6% of the MDN pages (2284 total) have been converted. Here are some highlights:

  1. Imported more CSS data, such as the huge list of allowed values for the list-style-type property (this list uses georgian). This property alone required 7 PRs, starting with PR 576. Daniel D. Beck submitted 32 CSS PRs that were merged in November, and is making good progress on converting CSS data.
  2. Added browser and version validation, a month-long effort in PR 439 from Florian Scholz and Jean-Yves Perrier.
  3. Added a runtime_flag for features that can be enabled at browser startup (PR 615 from Florian Scholz).
  4. Add the first compatibility data for Samsung Internet for Android (PR 657 from first-time contributor Peter O'Shaughnessy).
  5. Shipped the new compatibility table to beta users. Stephanie Hobson resurrected a design that had been through a few rounds of user testing (PR 4436), and has made further improvements such as augmenting colors with gradients (PR 4511). For more details and to give us feedback, see Beta Testing New Compatability Tables on Discourse.

New Browser Compatibiility Table

Sticky Table of Contents and Other Article Improvements

We shipped some additional article improvements in November.

The new table of contents is limited to the top-level headings, and “sticks” to the top of the window at desktop sizes, showing where you are in a document and allowing fast navigation (PR 4510 from Stephanie Hobson).

Sticky Table of Contents

The breadcrumbs (showing where you are in the page hierarchy) have moved to the sidebar, and now has schema.org metadata tags. Stephanie also refreshed the style of the sidebar links.

Breadcrumbs and Quick Links

Stephanie also updated the visual hierarchy of article headings. This is most noticeable on <h3> elements, which are now indented with black space.

New <h3> style

Improved MDN in AWS and Kubernetes

We continued to have performance and uptime issues in AWS in November. We’re prioritizing fixing these issues, and we’re delaying some 2017 plans, such as improving KumaScript translations and upgrading Django, to next year.

We lost GZip compression in the move to AWS. Ryan Johnson added it back in PR 4522. This reduced the average page download time by 71% (0.57s to 0.16s), and contributed to a 6% decrease in page load time (4.2 to 4.0s).

Page Download drop due to GZip

Heavy load due to scraping caused 6 downtimes totaling 35 minutes. We worked to improve the performance of unpopular pages that get high traffic from scrapers, such as document list views (PR 4463 from John Whitlock) and the revisions dashboard (PR 4520 from Josh Mize). This made the system more resilient.

Kubernetes was contributing to the downtimes, by restarting web servers when they started to undergo heavy load and were slow to respond. We’ve adjusted our “readiness” and “liveness” probes so that Kubernetes will be more patient and more gentle (Infra PR 665 from Ryan Johnson).

These changes have made MDN more resilient and reliable, but more work will be needed in December.

Stephanie Hobson fixed the development favicon appearing in production (PR 4530), as well as an issue with lazy-loading web fonts (PR 4533).

Ryan Johnson continues work on our deployment process. Pushing certain branches will cause Jenkins to take specific deployment steps. Pushing master will run tests and publish a Docker image. Pushing stage-push will deploy that image to stage.mdn.moz.works. Pushing stage-integration-tests will run browser and HTTP tests against that deployment. We’ll make these steps more reliable, add production variants, and then link them together into automated deployment pipelines.

Shipped Tweaks and Fixes

There were 260 PRs merged in November:

Many of these were from external contributors, including several first-time contributions. Here are some of the highlights:

Planned for December

Mozilla gathers for the All-Hands event in Austin, TX in December, which gives us a chance to get together, celebrate the year’s accomplishments, and plan for 2018. Mozilla offices will shut down for the last full week of December. This doesn’t leave a lot of time for coding.

We’ll continue working on the projects we worked on in November. We’ll convert more Browser Compatibility data. We’ll tweak the AWS infrastructure. We’ll eliminate and convert more waffle flags. We’ll watch the interactive examples and improved compatibility tables, and ship them when ready.

We’ll also take a step back, and ask if we’re spending time and attention on the most important things. We’ll think about our processes, and how they could better support our priorities.

But mostly, we’ll try not to mess things up, so that we can enjoy the holidays with friends and family, and come back refreshed for 2018.

MozMEAO SRE Status Report - November 14, 2017

Here’s what happened on the MozMEAO SRE team from November 7th - November 14th.

Current work

Firefox Quantum release

The team actively monitored our bedrock Kubernetes deployments during the release of [Firefox Quantum] (https://www.mozilla.org/en-US/firefox/). No manual intervention was required during the release.

SRE General

SUMO

  • an Elastic.co Elasticsearch development instance has been provisioned and is usable by the SUMO development team.
  • Redis and RDS provisioning automation has been merged, but resources have not been provisioned in AWS.
  • The team worked on a SUMO infra estimate for AWS.
    • Assumes existing K8s cluster, possible shared RDS/Elasticache.

MDN

Static site hosting

The team is in the process of evaluating the following static hosting solutions:

MozMEAO SRE Status Report - November 7, 2017

Here’s what happened on the MozMEAO SRE team from October 31st - November 7th.

Current work

SUMO

Work progresses on a SUMO development environment for use with Kubernetes in AWS.

MDN

Kuma Report, October 2017

Here’s what happened in October in Kuma, the engine of MDN:

  • MDN Migrated to AWS
  • Continued Migration of Browser Compatibility Data
  • Shipped tweaks and fixes

Here’s the plan for November:

  • Ship New Compat Table to Beta Users
  • Improve Performance of MDN and the Interactive Editor
  • Update Localization of KumaScript Macros

I’ve also included an overview of the AWS migration project, and an introduction to our new AWS infrastructure in Kubernetes, which helps make this the longest Kuma Report yet.

Done in October

MDN Migrated to AWS

On October 10, we moved MDN from Mozilla’s SCL3 datacenter to a Kubernetes cluster in the AWS us-west2 (Oregon) region. The database move went well, but we needed five times the web resources as the maintenance mode tests. We were able to smoothly scale up in the four hours we budgeted for the migration. Dave Parfitt and Ryan Johnson did a great job implementing a flexible set of deployment tools and monitors, that allowed us to quickly react to and handle the unexpected load.

The extra load was caused by mdn.mozillademos.org, which serves user uploads and wiki-based code samples. These untrusted resources are served from a different domain so that browsers will protect MDN users from the worst security issues. I excluded these resources from the production traffic tests, which turned out to be a mistake, since they represent 75% of the web traffic load after the move.

Ryan and I worked to get this domain behind a CDN. This included avoiding a Vary: Cookie header that was being added to all responses (PR 4469), and adding caching headers to each endpoint (PR 4462 and PR 4476).

We added CloudFront to the domain on October 26. Now most of these resources are served from the CloudFront CDN, which is fast and often closer to the MDN user (for example, served to French users from a server in Europe rather than California). Over a week, 197 GB was served from the CDN, versus 3 GB (1.5%) served from Kuma.

Bytes to User

There is a reduced load on Kuma as well. The CDN can handle many requests, so Kuma doesn’t see them at all. The CDN periodically checks with Kuma that content hasn’t changed, which often requires a short 304 Not Modified rather than the full response.

Backend requests for attachments have dropped by 45%:

Attachment Throughput

Code samples requests have dropped by 96%: Code Sample Throughput

We continue to use a CDN for our static assets, but not for developer.mozilla.org itself. We’d have to do similar work to add caching headers, ideally splitting anonymous content from logged-in content. The untrusted domain had 4 endpoints to consider, while developer.mozilla.org has 35 to 50. We hope to do this work in 2018.

Continued Migration of Browser Compatibility Data

The Browser Compatibility Data project was the most active MDN project in October. Another 700 MDN pages use the BCD data, bringing us up to 2200 MDN pages, or 35.5% of the pages with compatibility data.

Daniel D. Beck continues migrating the CSS data, which will take at least the rest of 2017. wbamberg continues to update WebExtension and API data, which needs to keep up with browser releases. Chris Mills migrated the Web Audio data with 32 PRs, starting with PR 433. This data includes mixin interfaces, and prompted some discussion about how to represent them in BCD in issue #472.

Florian Scholz added MDN URLs in PR 344, which will help BCD integrators to link back to MDN for more detailed information.

Browser names and versions are an important part of the compatibility data, and Florian and Jean-Yves Perrier worked to formalize their representation in BCD. This includes standardization of the first version, preferring “33” to “33.0” (PR 447 and more), and fixing some invalid version numbers (PR 449 and more). In November, BCD will add more of this data, allowing automated validation of version data, and enabling some alternate ways to present compat data.

Florian continues to release a new NPM package each Monday, and enabled tag-based releases (PR 565) for the most recent 0.0.12 release. mdn-browser-compat-data had over 900 downloads last month.

Shipped Tweaks and Fixes

There were 276 PRs merged in October:

Many of these were from external contributors, including several first-time contributions. Here are some of the highlights:

Planned for November

Ship New Compat Table to Beta Users

Stephanie Hobson and Florian are collaborating on a new compat table design for MDN, based on the BCD data. The new format summarizes support across desktop and mobile browsers, while still allowing developers to dive into the implementation details. We’ll ship this to beta users on 2200 MDN pages in November. See Beta Testing New Compatability Tables on Discourse for more details.

New Compat Table

Improve Performance of MDN and the Interactive Editor

Page load times have increased with the move to AWS. We’re looking into ways to increase performance across MDN. You can follow our MDN Post-migration project for more details. We also want to enable the interactive editor for all users, but we’re concerned about further increasing page load times. You can follow the remaining issues in the interactive-examples repo.

Update Localization of KumaScript Macros

In August, we planned the toolkit we’d use to extract strings from KumaScript macros (see bug 1340342). We put implementation on hold until after the AWS migration. In November, we’ll dust off the plans and get some sample macros converted. We’re hopeful the community will make short work of the rest of the macros.

MDN in AWS

The AWS migration project started in November 2014, bug 1110799. The original plan was to switch by summer 2015, but the technical and organizational hurdles proved harder than expected. At the same time, the team removed many legacy barriers making Kuma hard to migrate. A highlight of the effort was the Mozilla All Hands in December 2015, where the team merged several branches of work-in-progress code to get Kuma running in Heroku. Thanks to Jannis Leidel, Rob Hudson, Luke Crouch, Lonnen, Will Kahn-Greene, David Walsh, James Bennet, cyliang, Jake, Sean Rich, Travis Blow, Sheeri Cabral, and everyone else who worked on or influenced this first phase of the project.

The migration project rebooted in Summer 2016. We switched to targeting Mozilla Marketing’s deployment environment. I split the work into smaller steps leading up to AWS. I thought each step would take about a month. They took about 3 months each. Estimating is hard.

2016 MDN Tech Plan

Changes to MDN Services

MDN no longer uses Apache to serve files and proxy Kuma. Instead, Kuma serves requests directly with gunicorn with the meinheld worker. I did some analysis in January, and Dave Parfitt and Ryan Johnson led the effort to port Apache features to Kuma:

  • Redirects are implemented with Paul McLanahan’s django-redirect-urls.
  • Static assets (CSS, JavaScript, etc.) are served directly with WhiteNoise.
  • Kuma handles the domain-based differences between the main website and the untrusted domain.
  • Miscellaneous files like robots.txt, sitemaps, and legacy files (from the early days of MDN) are served directly.
  • Kuma adds security headers to responses.

Another big change is how the services are run. The base unit of implementation in SCL3 was multi-purpose virtual machines (VMs). In AWS, we are switching to application-specific Docker containers.

In SCL3, the VMs were split into 6 user-facing web servers and 4 backend Celery servers. In AWS, the EC2 servers act as Docker hosts. Docker uses operating system virtualization, which has several advantages over machine virtualization for our use cases. The Docker images are distributed over the EC2 servers, as chosen by Kubernetes.

SCL3 versus AWS Servers

The SCL3 servers were maintained as long-running servers, using Puppet to install security updates and new software. The servers were multi-purpose, used for Kuma, KumaScript, and backend Celery processes. With Docker, we instead use a Python/Kuma image and a node.js/KumaScript image to implement MDN.

SCL3 versus AWS MDN units

The Python/Kuma image is configurable through environment variables to run in different domains (such as staging or production), and to be configured as one of our three main Python services:

  • web - User-facing Kuma service
  • celery - Backend Kuma processes outside of the request loop
  • api - A backend Kuma service, used by KumaScript to render pages. This avoids an issue in SCL3 where KumaScript API calls were competing with MDN user requests.

Our node.js/KumaScript service is also configured via environment variables, and implements the fourth main service of MDN:

  • kumascript - The node.js service that renders wiki pages

Building the Docker images involves installing system software, installing the latest code, creating the static files, compiling translations, and preparing other run-time assets. AWS deployments are the relatively fast process of switching to newer Docker images. This is an improvement over SCL3, which required doing most of the work during deployment while developers watched.

An Introduction to Kubernetes

Kubernetes is a system for automating the deployment, scaling, and management of containerized applications. Kubernetes’s view of MDN looks like this:

AWS MDN from Kubernetes' Perspective

A big part of understanding Kubernetes is learning the vocabulary. Kubernetes Concepts is a good place to start. Here’s how some of these concepts are implemented for MDN:

  • Ten EC2 instances in AWS are configured as Nodes, and joined into a Kubernetes Cluster. Our “Portland Cluster” is in the us-west2 (Oregon) AWS region. Nine Nodes are available for application usage, and the master Node runs the Cluster.
  • The mdn-prod Namespace collects the resources that need to collaborate to make MDN work. The mdn-stage Namespace is also in the Portland Cluster, as well as other Mozilla projects.
  • A Service defines a service provided by an application at a TCP port. For example, a webserver provides an HTTP service on port 80.
    • The web service is connected to the outside world via an AWS Elastic Load Balancer (ELB), which can reach it at https://developer.mozilla.org (the main site) and https://mdn.mozillademos.org (the untrusted resources).
    • The api and kumascript services are available inside the cluster, but not routed to the outside world.
    • celery doesn’t accept HTTP requests, and so it doesn’t get a Service.
  • The application that provides a service is defined by a Deployment, which declares what Docker image and tag will be used, how many replicas are desired, the CPU and memory budget, what disk volumes should be mounted, and what the environment configuration should be.
  • A Kubernetes Deployment is a higher-level object, implemented with a ReplicaSet, which then starts up several Pods to meet the demands. ReplicaSets are named after the Service plus a random number, such as web-61720, and the Pods are named after the ReplicaSets plus a random string, like web-61720-s7l.

ReplicaSets and Pods come into play when new software is rolled out. The Deployment creates a new ReplicaSet for the desired state, and creates new Pods to implement it, while it destroys the Pods in the old ReplicaSet. This rolling deployment ensures that the application is fully available while new code and configurations are deployed. If something is wrong with the new code that makes the application crash immediately, the deployment is cancelled. If it goes well, the old ReplicaSet is kept around, making it easier to rollback for subtler bugs.

Kubernetes Rolling Deployment

This deployment style puts the burden on the developer to ensure that the two versions can run at the same time. Caution is needed around database changes and some interface changes. In exchange, deployments are smooth and safe with no downtime. Most of the setup work is done when the Docker images are created, so deployments take about a minute from start to finish.

Kubernetes takes control of deploying the application and ensures it keeps running. It allocates Pods to Nodes (called Scheduling), based on the CPU and memory budget for the Pod, and the existing load on each Node. If a Pod terminates, due to an error or other cause, it will be restarted or recreated. If a Node fails, replacement Pods will be created on surviving Nodes.

The Kubernetes system allows several ways to scale the application. We used some for handling the unexpected load of the user attachments:

  • We went from 10 to 11 Nodes, to increase the total capacity of the Cluster.
  • We scaled the web Deployment from 6 to 20 Pods, to handle more simultaneous connections, including the slow file requests.
  • We scaled the celery Deployment from 6 to 10 Pods, to handle the load of populating the cold cache.
  • We adjusted the gunicorn worker threads from 4 to 8, to increase the simultaneous connections
  • We rolled out new code to improve caching

There are many more details, which you can explore by reading our configuration files in the infra repo. We use Jinja for our templates, which we find more readable than the Go templates used by many Kubernetes projects. We’ll continue to refine these as we adjust and improve our infrastructure. You can see our current tasks by following the MDN Post-migration project.