MozMEAO SRE Status Report - December 5, 2017

05 Dec 2017, Dave Parfitt

Here’s what happened on the MozMEAO SRE team from November 14th - December 5th.

Current work

SUMO

Work continues on the SUMO move to AWS. We’ve provisioned a small RDS MySQL instance in AWS for development and tried importing a production snapshot. The import took 30 hours on a db.t2.small instance, so we experimented with temporarily scaling the RDS instance to a an db.m4.xlarge. The import is now expected to complete in 5 hours.

We will investigate if incremental backup/restore is an option for the production transition.

MDN

MDN had several short downtime events in November, caused by heavy load due to scraping. Our K8s liveness and readiness probes often forced pods to restart when MySQL was slow to respond.

Several readiness and liveness probe changes were issued by @escattone and @jwhitlock to help alleviate the issue:

The November 2017 Kuma report has additional details.

We now have a few load balancer infrastructure tests for MDN, implemented in this pull request.

MDN Interactive-examples

Caching is now more granular due to setting different cache times for different assets.

Bedrock

Bedrock is transitioning to a local Sqlite DB and clock process in every container. This removes the dependency on RDS and makes running Bedrock cheaper. In preparation for this change, S3 buckets have been created for dev, stage and prod.

Kuma Report, November 2017

05 Dec 2017, John Whitlock

Here’s what happened in November in Kuma, the engine of MDN Web Docs:

Shipped the first 21 interactive examples
Added browser versions, list-style-type, and even more Browser Compatibility Data
Shipped a sticky table of contents and other article improvements
Improved MDN in AWS and Kubernetes
Shipped tweaks and fixes by merging 260 pull requests, including 33 pull requests from 27 new contributors.

We’re planning on more of the same for December.

Done in November

Shipped the First Interactive Examples

We’ve launched the new interactive examples on 20+ pages. Try them out on the pages for the CSS property box-shadow and the JavaScript method Array.slice.

box-shadow example

We’re monitoring the page load impact of this limited rollout, and if the results are good, we have another 400 examples ready to go, thanks to Mark Boas and others. Mark also added a JavaScript Interactive Examples Contributing Guide, so that contributors can create even more.

We want the examples to be as fast as possible. Schalk Neethling improved the page load speed of the <iframe> by using preload URLs (PR 4537). Stephanie Hobson and Schalk dived into HTTP/2, and identified require.js as a potential issue for this protocol (Kuma PR 4521 and Interactive Examples PR 329). Josh Mize added appropriate caching headers for the examples and static assets (PR 326).

For the next level of speed gains, we’ll need to speed up the MDN pages themselves. One possibility is to serve developer.mozilla.org from a CDN, which will require big changes to make pages more cacheable. One issue is waffle flags, which allow us to experiment with per-user changes, at the cost of making pages uncacheable. Schalk has made steady progress in eliminating inactive waffle flag experiments, and this work will continue into December.

Continued Migration of Browser Compatibility Data

The Browser Compatibility Data project was the most active MDN project in November. 36.6% of the MDN pages (2284 total) have been converted. Here are some highlights:

Imported more CSS data, such as the huge list of allowed values for the list-style-type property (this list uses georgian). This property alone required 7 PRs, starting with PR 576. Daniel D. Beck submitted 32 CSS PRs that were merged in November, and is making good progress on converting CSS data.
Added browser and version validation, a month-long effort in PR 439 from Florian Scholz and Jean-Yves Perrier.
Added a runtime_flag for features that can be enabled at browser startup (PR 615 from Florian Scholz).
Add the first compatibility data for Samsung Internet for Android (PR 657 from first-time contributor Peter O'Shaughnessy).
Shipped the new compatibility table to beta users. Stephanie Hobson resurrected a design that had been through a few rounds of user testing (PR 4436), and has made further improvements such as augmenting colors with gradients (PR 4511). For more details and to give us feedback, see Beta Testing New Compatability Tables on Discourse.

New Browser Compatibiility Table

Sticky Table of Contents and Other Article Improvements

We shipped some additional article improvements in November.

The new table of contents is limited to the top-level headings, and “sticks” to the top of the window at desktop sizes, showing where you are in a document and allowing fast navigation (PR 4510 from Stephanie Hobson).

Sticky Table of Contents

The breadcrumbs (showing where you are in the page hierarchy) have moved to the sidebar, and now has schema.org metadata tags. Stephanie also refreshed the style of the sidebar links.

Breadcrumbs and Quick Links

Stephanie also updated the visual hierarchy of article headings. This is most noticeable on <h3> elements, which are now indented with black space.

New <h3> style

Improved MDN in AWS and Kubernetes

We continued to have performance and uptime issues in AWS in November. We’re prioritizing fixing these issues, and we’re delaying some 2017 plans, such as improving KumaScript translations and upgrading Django, to next year.

We lost GZip compression in the move to AWS. Ryan Johnson added it back in PR 4522. This reduced the average page download time by 71% (0.57s to 0.16s), and contributed to a 6% decrease in page load time (4.2 to 4.0s).

Page Download drop due to GZip

Heavy load due to scraping caused 6 downtimes totaling 35 minutes. We worked to improve the performance of unpopular pages that get high traffic from scrapers, such as document list views (PR 4463 from John Whitlock) and the revisions dashboard (PR 4520 from Josh Mize). This made the system more resilient.

Kubernetes was contributing to the downtimes, by restarting web servers when they started to undergo heavy load and were slow to respond. We’ve adjusted our “readiness” and “liveness” probes so that Kubernetes will be more patient and more gentle (Infra PR 665 from Ryan Johnson).

These changes have made MDN more resilient and reliable, but more work will be needed in December.

Stephanie Hobson fixed the development favicon appearing in production (PR 4530), as well as an issue with lazy-loading web fonts (PR 4533).

Ryan Johnson continues work on our deployment process. Pushing certain branches will cause Jenkins to take specific deployment steps. Pushing master will run tests and publish a Docker image. Pushing stage-push will deploy that image to stage.mdn.moz.works. Pushing stage-integration-tests will run browser and HTTP tests against that deployment. We’ll make these steps more reliable, add production variants, and then link them together into automated deployment pipelines.

Shipped Tweaks and Fixes

There were 260 PRs merged in November:

Many of these were from external contributors, including several first-time contributions. Here are some of the highlights:

Update chrome_url_overrides for Opera (BCD PR 559), from first-time contributor Zbyněk Eiselt.
Mark add and set methods of Set, Map and WeakMap objects as partially implemented on IE 11 (BCD PR 586), from first-time contributor Ivan Buryak.
Add link to Edge bug report for <a> (BCD PR 592), from first-time contributor Michael Hogg.
Update status for CSS Scroll Snapping properties (BCD PR 609), from first-time contributor Masataka Yakura.
Update IANA timezone name support in Chrome/Opera (BCD PR 611), from first-time contributor jungshik.
Update browser identifier declaration instructions (BCD PR 625), Fix schema documentation for flag (BCD PR 627), and Add example for Status information (BCD PR 628), from first-time contributor Ra’Shaun Stovall.
Update support data for parseInt treatment of leading zeros (BCD PR 633), from first-time contributor Claude Pache.
Update textarea @autocomplete compat data (BCD PR 637 and PR 673), from first-time contributor Matt N.
Add sampleRate option to new AudioContext() (BCD PR 651), from first-time contributor Jedipedia.
Add support for page_action for FF for Android (BCD PR 667), from first-time contributor Elad.
Safari has implemented upgrade-insecure-requests (BCD PR 670), from first-time contributor Justyn Temme.
TypedArray.toString (BCD PR 677), from first-time contributor Lambdac0re.
Update support for content-security-policy (BCD PR 683), from first-time contributor Jakob Jarosch.
Add note about Origin header when using POST requests in Edge (BCD PR 684), from first-time contributor Viktor.
Add the ability for users to hide GitHub link from their public profile (bug 1360294, Kuma PR 4346), from Maton Anthony.
Add details about the changes in the .env (Kuma PR 4494), from first-time contributor Pavan Gudiwada.
Docker setup guidance (Kuma PR 4501), from first-time contributor Pavan Gudiwada.
Enable Telugu (te) as candidate locale (bug 984149, Kuma PR 4547), from John Whitlock.
Add curl as an alternative to wget (bug 1387505, Kuma PR 4570), from first-time contributor Deep Bhattacharyya.
Add Mozilla Foundation End-of-Year callout (bug 1420535, Kuma PR 4572), from Stephanie Hobson.
Add support for Bulgarian. (KumaScript PR 374), from first-time contributor Красимир Беров.
Mark Background Tasks as Proposed Recommendation (KumaScript PR 390), from first-time contributor Masataka Yakura.
Close <li> tag properly (KumaScript PR 398), from first-time contributor antonio-piha.
Add French for event properties (KumaScript PR 408), from first-time contributor Matilin Torre.
Add Dutch translation (KumaScript PR 412), from first-time contributor evelijn.
Fix a mistake in array-reduce comment (Interactive Examples PR 358), from first-time contributor Gal Pasternak.
Fix AudioContext & OfflineAudioContext inheritance (Data PR 152), from first-time contributor Jedipedia.

Planned for December

Mozilla gathers for the All-Hands event in Austin, TX in December, which gives us a chance to get together, celebrate the year’s accomplishments, and plan for 2018. Mozilla offices will shut down for the last full week of December. This doesn’t leave a lot of time for coding.

We’ll continue working on the projects we worked on in November. We’ll convert more Browser Compatibility data. We’ll tweak the AWS infrastructure. We’ll eliminate and convert more waffle flags. We’ll watch the interactive examples and improved compatibility tables, and ship them when ready.

We’ll also take a step back, and ask if we’re spending time and attention on the most important things. We’ll think about our processes, and how they could better support our priorities.

But mostly, we’ll try not to mess things up, so that we can enjoy the holidays with friends and family, and come back refreshed for 2018.

MozMEAO SRE Status Report - November 14, 2017

14 Nov 2017, Dave Parfitt

Here’s what happened on the MozMEAO SRE team from November 7th - November 14th.

Current work

Firefox Quantum release

The team actively monitored our bedrock Kubernetes deployments during the release of [Firefox Quantum] (https://www.mozilla.org/en-US/firefox/). No manual intervention was required during the release.

SRE General

To step up our efforts on the security front, we’ve updated all of our application Docker images to use a few recommended images.

SUMO

an Elastic.co Elasticsearch development instance has been provisioned and is usable by the SUMO development team.
Redis and RDS provisioning automation has been merged, but resources have not been provisioned in AWS.
The team worked on a SUMO infra estimate for AWS.
- Assumes existing K8s cluster, possible shared RDS/Elasticache.

MDN

Additional domains have been added to the MDN ELB certificate to support the following legacy domains:
- developer.mozilla.com
- devmo.developer.mozilla.org
- mdn.mozilla.org
- developer-new.mozilla.org
- developers.mozilla.org
Stalled asset loading leading to high load times
Optimize prefetch_related usage in revisions dashboard

Static site hosting

The team is in the process of evaluating the following static hosting solutions:

S3 + Cloudfront (+ Lambda@Edge for header tweaking)
- some of our sites, including IRL and Viewsourceconf already use this set of services from AWS.
- we have experienced some performance degredation issues when switching from nginx/K8s to S3/Cloudfront.
Google Firebase hosting
Netlify
Nginx containers in Kubernetes

MozMEAO SRE Status Report - November 7, 2017

07 Nov 2017, Dave Parfitt

Here’s what happened on the MozMEAO SRE team from October 31st - November 7th.

Current work

SUMO

Work progresses on a SUMO development environment for use with Kubernetes in AWS.

S3 and Cloudfront distributions have been provisioned, and DNS has been setup for the following domains:
We decided to use a small elastic.co-hosted Elasticsearch instance instead of using the Helm chart due to difficulties getting the chart to run.
The team has decided use MySQL or MariaDB instead of pursuing a migration to Postgres before moving to AWS.

MDN

The October 2017 Kuma Report describes the MDN move to AWS.
The mdn-samples.mozilla.org domain is in the process of being moved to Kubernetes.
- See also https://bugzilla.mozilla.org/show_bug.cgi?id=1414294
We’re now varying the CloudFront cache on the querystring parameter revision, which is used to refresh embeded live samples when the document is updated.
- bug 1414419: Vary MDN attachment CDN on ?revision=X
Work to get Jenkins to push to Kubernetes is in progress in the following PRs:
- bug 1406546: refactor groovy code to handle kumascript as well as kuma
- bug 1406546: add code for deploying kumascript to stage

Kuma Report, October 2017

07 Nov 2017, John Whitlock

Here’s what happened in October in Kuma, the engine of MDN:

MDN Migrated to AWS
Continued Migration of Browser Compatibility Data
Shipped tweaks and fixes

Here’s the plan for November:

Ship New Compat Table to Beta Users
Improve Performance of MDN and the Interactive Editor
Update Localization of KumaScript Macros

I’ve also included an overview of the AWS migration project, and an introduction to our new AWS infrastructure in Kubernetes, which helps make this the longest Kuma Report yet.

Done in October

MDN Migrated to AWS

On October 10, we moved MDN from Mozilla’s SCL3 datacenter to a Kubernetes cluster in the AWS us-west2 (Oregon) region. The database move went well, but we needed five times the web resources as the maintenance mode tests. We were able to smoothly scale up in the four hours we budgeted for the migration. Dave Parfitt and Ryan Johnson did a great job implementing a flexible set of deployment tools and monitors, that allowed us to quickly react to and handle the unexpected load.

The extra load was caused by mdn.mozillademos.org, which serves user uploads and wiki-based code samples. These untrusted resources are served from a different domain so that browsers will protect MDN users from the worst security issues. I excluded these resources from the production traffic tests, which turned out to be a mistake, since they represent 75% of the web traffic load after the move.

Ryan and I worked to get this domain behind a CDN. This included avoiding a Vary: Cookie header that was being added to all responses (PR 4469), and adding caching headers to each endpoint (PR 4462 and PR 4476).

We added CloudFront to the domain on October 26. Now most of these resources are served from the CloudFront CDN, which is fast and often closer to the MDN user (for example, served to French users from a server in Europe rather than California). Over a week, 197 GB was served from the CDN, versus 3 GB (1.5%) served from Kuma.

Bytes to User

There is a reduced load on Kuma as well. The CDN can handle many requests, so Kuma doesn’t see them at all. The CDN periodically checks with Kuma that content hasn’t changed, which often requires a short 304 Not Modified rather than the full response.

Backend requests for attachments have dropped by 45%:

Attachment Throughput

Code samples requests have dropped by 96%: Code Sample Throughput

We continue to use a CDN for our static assets, but not for developer.mozilla.org itself. We’d have to do similar work to add caching headers, ideally splitting anonymous content from logged-in content. The untrusted domain had 4 endpoints to consider, while developer.mozilla.org has 35 to 50. We hope to do this work in 2018.

Continued Migration of Browser Compatibility Data

The Browser Compatibility Data project was the most active MDN project in October. Another 700 MDN pages use the BCD data, bringing us up to 2200 MDN pages, or 35.5% of the pages with compatibility data.

Daniel D. Beck continues migrating the CSS data, which will take at least the rest of 2017. wbamberg continues to update WebExtension and API data, which needs to keep up with browser releases. Chris Mills migrated the Web Audio data with 32 PRs, starting with PR 433. This data includes mixin interfaces, and prompted some discussion about how to represent them in BCD in issue #472.

Florian Scholz added MDN URLs in PR 344, which will help BCD integrators to link back to MDN for more detailed information.

Browser names and versions are an important part of the compatibility data, and Florian and Jean-Yves Perrier worked to formalize their representation in BCD. This includes standardization of the first version, preferring “33” to “33.0” (PR 447 and more), and fixing some invalid version numbers (PR 449 and more). In November, BCD will add more of this data, allowing automated validation of version data, and enabling some alternate ways to present compat data.

Florian continues to release a new NPM package each Monday, and enabled tag-based releases (PR 565) for the most recent 0.0.12 release. mdn-browser-compat-data had over 900 downloads last month.

Shipped Tweaks and Fixes

There were 276 PRs merged in October:

Many of these were from external contributors, including several first-time contributions. Here are some of the highlights:

Edge doesn’t support isIntersecting (BCD PR 430), from first-time contributor Ivan Čurić.
Add the PointerEvent api (BCD PR 484), from first-time contributor lpd-au.
Split WebExt JSONs (BCD PR 488), from wbamberg.
PointerEvents: Remove parentheses and add mdn_urls (BCD PR 491, issue #490), first of 3 PRs from first-time BCD contributor Maton Anthony.
Change in frame-ancestors support (BCD PR 505, Bug 1380755), from first-time contributor Jason Tarka.
Update let keyword notes (BCD PR 517), from first-time BCD contributor jsx.
Update regex anchored sticky flag compatibility (BCD PR 561), from first-time contributor John Lenz.
Update IE and Edge version compatibility for the button form attribute (BCD PR 566), from first-time contributor cb-josh-c.
Add dns-prefetch, preconnect for editor iframe (Kuma PR 4455, issue #238), from Schalk Neethling.
Switch docs to alabaster theme (Kuma PR 4457), from John Whitlock.
Reduce padding and change bgcolor on inline code (Kuma PR 4471), from Stephanie Hobson.
Prevent ISE when no file in cleaned data (Kuma PR 4475, bug 1410559), from Ryan Johnson.
Fix lazy loading fonts (Kuma PR 4482, bug 1150118), from Stephanie Hobson.
Delete CSSDataTypes.ejs (KumaScript PR 337), from mfluehr.
Typo in GamepadEventProperties.ejs (KumaScript PR 341), from first-time contributor 즈눅.
Add missing written articles to LearnSidebar (Client-side Web APIs) (KumaScript PR 347), from first-time KS contributor Jean-Yves Perrier.
Add translations for French (fr) locale (KumaScript PR 353), from first-time contributor Kévin C.
Added some Spanish translations (KumaScript PR 376), from first-time contributor Juan Ferrer Toribio.
Datadog implementation for MDN Redis monitoring (Infra PR 561), from Dave Parfitt.
Production configs for go-live day (Infra PR 565), from Ryan Johnson.
MDN MM Frankfurt support (Infra PR 606), from Dave Parfitt.
Add k8s job for doing MDN database migrations (Infra PR 610), from Ryan Johnson.
Change MDN CDN origin to developer.mozilla.org (Infra PR 624), from Dave Parfitt.
Add futuristic ASCII art diagram for MDN DNS (Infra PR 628), from Dave Parfitt.
Update Fragmentation-related properties (Data PR 111), first of 15 Data PRs from mfluehr.
Add inheritance file (from {{InterfaceData}}) and its schema (Data PR 122), from first-time Data contributor Jean-Yves Perrier.
Add font-variation-settings descriptor to @font-face (Data PR 125), from first-time Data contributor jsx.
Fix broken reference to units.md (Data PR 137), from first-time contributor Jonathan Neal.
Copy edit of the readme file (Interactive Examples PR 297), from first-time Interactive Examples contributor Chris Mills.
Further performance improvements, and adds Jest for testing (Interactive Examples PR 317), from Schalk Neethling.

Planned for November

Ship New Compat Table to Beta Users

Stephanie Hobson and Florian are collaborating on a new compat table design for MDN, based on the BCD data. The new format summarizes support across desktop and mobile browsers, while still allowing developers to dive into the implementation details. We’ll ship this to beta users on 2200 MDN pages in November. See Beta Testing New Compatability Tables on Discourse for more details.

New Compat Table

Improve Performance of MDN and the Interactive Editor

Page load times have increased with the move to AWS. We’re looking into ways to increase performance across MDN. You can follow our MDN Post-migration project for more details. We also want to enable the interactive editor for all users, but we’re concerned about further increasing page load times. You can follow the remaining issues in the interactive-examples repo.

Update Localization of KumaScript Macros

In August, we planned the toolkit we’d use to extract strings from KumaScript macros (see bug 1340342). We put implementation on hold until after the AWS migration. In November, we’ll dust off the plans and get some sample macros converted. We’re hopeful the community will make short work of the rest of the macros.

MDN in AWS

The AWS migration project started in November 2014, bug 1110799. The original plan was to switch by summer 2015, but the technical and organizational hurdles proved harder than expected. At the same time, the team removed many legacy barriers making Kuma hard to migrate. A highlight of the effort was the Mozilla All Hands in December 2015, where the team merged several branches of work-in-progress code to get Kuma running in Heroku. Thanks to Jannis Leidel, Rob Hudson, Luke Crouch, Lonnen, Will Kahn-Greene, David Walsh, James Bennet, cyliang, Jake, Sean Rich, Travis Blow, Sheeri Cabral, and everyone else who worked on or influenced this first phase of the project.

The migration project rebooted in Summer 2016. We switched to targeting Mozilla Marketing’s deployment environment. I split the work into smaller steps leading up to AWS. I thought each step would take about a month. They took about 3 months each. Estimating is hard.

2016 MDN Tech Plan

Changes to MDN Services

MDN no longer uses Apache to serve files and proxy Kuma. Instead, Kuma serves requests directly with gunicorn with the meinheld worker. I did some analysis in January, and Dave Parfitt and Ryan Johnson led the effort to port Apache features to Kuma:

Redirects are implemented with Paul McLanahan’s django-redirect-urls.
Static assets (CSS, JavaScript, etc.) are served directly with WhiteNoise.
Kuma handles the domain-based differences between the main website and the untrusted domain.
Miscellaneous files like robots.txt, sitemaps, and legacy files (from the early days of MDN) are served directly.
Kuma adds security headers to responses.

Another big change is how the services are run. The base unit of implementation in SCL3 was multi-purpose virtual machines (VMs). In AWS, we are switching to application-specific Docker containers.

In SCL3, the VMs were split into 6 user-facing web servers and 4 backend Celery servers. In AWS, the EC2 servers act as Docker hosts. Docker uses operating system virtualization, which has several advantages over machine virtualization for our use cases. The Docker images are distributed over the EC2 servers, as chosen by Kubernetes.

SCL3 versus AWS Servers

The SCL3 servers were maintained as long-running servers, using Puppet to install security updates and new software. The servers were multi-purpose, used for Kuma, KumaScript, and backend Celery processes. With Docker, we instead use a Python/Kuma image and a node.js/KumaScript image to implement MDN.

SCL3 versus AWS MDN units

The Python/Kuma image is configurable through environment variables to run in different domains (such as staging or production), and to be configured as one of our three main Python services:

web - User-facing Kuma service
celery - Backend Kuma processes outside of the request loop
api - A backend Kuma service, used by KumaScript to render pages. This avoids an issue in SCL3 where KumaScript API calls were competing with MDN user requests.

Our node.js/KumaScript service is also configured via environment variables, and implements the fourth main service of MDN:

kumascript - The node.js service that renders wiki pages

Building the Docker images involves installing system software, installing the latest code, creating the static files, compiling translations, and preparing other run-time assets. AWS deployments are the relatively fast process of switching to newer Docker images. This is an improvement over SCL3, which required doing most of the work during deployment while developers watched.

An Introduction to Kubernetes

Kubernetes is a system for automating the deployment, scaling, and management of containerized applications. Kubernetes’s view of MDN looks like this:

AWS MDN from Kubernetes' Perspective

A big part of understanding Kubernetes is learning the vocabulary. Kubernetes Concepts is a good place to start. Here’s how some of these concepts are implemented for MDN:

Ten EC2 instances in AWS are configured as Nodes, and joined into a Kubernetes Cluster. Our “Portland Cluster” is in the us-west2 (Oregon) AWS region. Nine Nodes are available for application usage, and the master Node runs the Cluster.
The mdn-prod Namespace collects the resources that need to collaborate to make MDN work. The mdn-stage Namespace is also in the Portland Cluster, as well as other Mozilla projects.
A Service defines a service provided by an application at a TCP port. For example, a webserver provides an HTTP service on port 80.
- The web service is connected to the outside world via an AWS Elastic Load Balancer (ELB), which can reach it at https://developer.mozilla.org (the main site) and https://mdn.mozillademos.org (the untrusted resources).
- The api and kumascript services are available inside the cluster, but not routed to the outside world.
- celery doesn’t accept HTTP requests, and so it doesn’t get a Service.
The application that provides a service is defined by a Deployment, which declares what Docker image and tag will be used, how many replicas are desired, the CPU and memory budget, what disk volumes should be mounted, and what the environment configuration should be.
A Kubernetes Deployment is a higher-level object, implemented with a ReplicaSet, which then starts up several Pods to meet the demands. ReplicaSets are named after the Service plus a random number, such as web-61720, and the Pods are named after the ReplicaSets plus a random string, like web-61720-s7l.

ReplicaSets and Pods come into play when new software is rolled out. The Deployment creates a new ReplicaSet for the desired state, and creates new Pods to implement it, while it destroys the Pods in the old ReplicaSet. This rolling deployment ensures that the application is fully available while new code and configurations are deployed. If something is wrong with the new code that makes the application crash immediately, the deployment is cancelled. If it goes well, the old ReplicaSet is kept around, making it easier to rollback for subtler bugs.

Kubernetes Rolling Deployment

This deployment style puts the burden on the developer to ensure that the two versions can run at the same time. Caution is needed around database changes and some interface changes. In exchange, deployments are smooth and safe with no downtime. Most of the setup work is done when the Docker images are created, so deployments take about a minute from start to finish.

Kubernetes takes control of deploying the application and ensures it keeps running. It allocates Pods to Nodes (called Scheduling), based on the CPU and memory budget for the Pod, and the existing load on each Node. If a Pod terminates, due to an error or other cause, it will be restarted or recreated. If a Node fails, replacement Pods will be created on surviving Nodes.

The Kubernetes system allows several ways to scale the application. We used some for handling the unexpected load of the user attachments:

We went from 10 to 11 Nodes, to increase the total capacity of the Cluster.
We scaled the web Deployment from 6 to 20 Pods, to handle more simultaneous connections, including the slow file requests.
We scaled the celery Deployment from 6 to 10 Pods, to handle the load of populating the cold cache.
We adjusted the gunicorn worker threads from 4 to 8, to increase the simultaneous connections
We rolled out new code to improve caching

There are many more details, which you can explore by reading our configuration files in the infra repo. We use Jinja for our templates, which we find more readable than the Go templates used by many Kubernetes projects. We’ll continue to refine these as we adjust and improve our infrastructure. You can see our current tasks by following the MDN Post-migration project.

Older Newer

Mozilla Marketing Engineering & Operations blog

Current work

SUMO

MDN

MDN Interactive-examples

Bedrock

Links

Done in November

Current work

Firefox Quantum release

SRE General

SUMO

MDN

Static site hosting

Links

Current work

SUMO

MDN

Links

Done in October

MDN Migrated to AWS

Continued Migration of Browser Compatibility Data

Shipped Tweaks and Fixes

Planned for November

Ship New Compat Table to Beta Users

Improve Performance of MDN and the Interactive Editor

Update Localization of KumaScript Macros

MDN in AWS

Changes to MDN Services

An Introduction to Kubernetes