Here’s what happened on the MozMEAO SRE team from November 14th - December 5th.
Current work
SUMO
Work continues on the SUMO move to AWS. We’ve provisioned a small RDS MySQL instance in AWS for development and tried importing a production snapshot. The import took 30 hours on a db.t2.small instance, so we experimented with temporarily scaling the RDS instance to a an db.m4.xlarge. The import is now expected to complete in 5 hours.
We will investigate if incremental backup/restore is an option for the production transition.
MDN
MDN had several short downtime events in November, caused by heavy load due to scraping. Our K8s liveness and readiness probes often forced pods to restart when MySQL was slow to respond.
Several readiness and liveness probe changes were issued by
@escattone and @jwhitlock to help alleviate the issue:
We want the examples to be as fast as possible.
Schalk Neethling improved
the page load speed of the <iframe> by using preload
URLs
(PR 4537).
Stephanie Hobson and Schalk dived into
HTTP/2, and identified require.js as a potential issue for this
protocol (Kuma PR 4521 and
Interactive Examples PR 329).
Josh Mize added appropriate
caching headers for the examples and static assets
(PR 326).
For the next level of speed gains, we’ll need to speed up the MDN pages
themselves. One possibility is to serve developer.mozilla.org from a CDN,
which will require big changes to make pages more cacheable. One issue is
waffle flags,
which allow us to experiment with per-user changes, at the cost of making pages
uncacheable. Schalk has made steady progress in eliminating inactive waffle
flag experiments, and this work will continue into December.
The Browser Compatibility Data
project was the most active MDN project in November. 36.6% of the MDN pages
(2284 total) have been converted. Here are some highlights:
Imported more CSS data, such as the huge list of allowed values for the
list-style-type
property (this list uses georgian).
This property alone required 7 PRs, starting with
PR 576.
Daniel D. Beck submitted 32 CSS
PRs that were merged in November, and is making good progress on
converting CSS data.
Added a runtime_flag for features that can be enabled at
browser startup
(PR 615
from
Florian Scholz).
Add the first compatibility data for Samsung Internet for Android
(PR 657
from first-time contributor
Peter O'Shaughnessy).
Shipped the new compatibility table to beta users.
Stephanie Hobson
resurrected a design that had been through a few rounds of user testing
(PR 4436),
and has made further improvements such as augmenting colors with
gradients
(PR 4511).
For more details and to give us feedback, see
Beta Testing New Compatability Tables
on Discourse.
We shipped some additional article improvements in November.
The new table of contents is limited to the top-level headings, and “sticks” to
the top of the window at desktop sizes, showing where you are in a document and
allowing fast navigation (PR 4510
from Stephanie Hobson).
The breadcrumbs (showing where you are in the page hierarchy) have moved to the
sidebar, and now has schema.org metadata tags.
Stephanie also refreshed the style of the sidebar links.
Stephanie also updated the visual hierarchy of article headings. This is most
noticeable on <h3> elements, which are now indented with
black space.
We continued to have performance and uptime issues in AWS in November. We’re
prioritizing fixing these issues, and we’re delaying some 2017 plans, such as
improving KumaScript translations and upgrading Django, to next year.
We lost GZip compression in the move to AWS.
Ryan Johnson added it back in
PR 4522. This reduced the
average page download time by 71% (0.57s to 0.16s), and contributed to a
6% decrease in page load time (4.2 to 4.0s).
Heavy load due to scraping caused 6 downtimes totaling 35 minutes.
We worked to improve the performance of unpopular pages that get high traffic
from scrapers, such as document list views
(PR 4463 from
John Whitlock) and the revisions dashboard
(PR 4520 from
Josh Mize). This made the system more resilient.
Kubernetes was contributing to the downtimes, by restarting web servers when
they started to undergo heavy load and were slow to respond. We’ve adjusted
our “readiness” and “liveness” probes so that Kubernetes will be more patient
and more gentle
(Infra PR 665 from
Ryan Johnson).
These changes have made MDN more resilient and reliable, but more work will be
needed in December.
Stephanie Hobson fixed the development
favicon appearing in production
(PR 4530), as well as an issue
with lazy-loading web fonts
(PR 4533).
Ryan Johnson continues work on our deployment
process. Pushing certain branches will cause Jenkins to take specific
deployment steps. Pushing master will run tests and publish a Docker image.
Pushing stage-push will deploy that image to
stage.mdn.moz.works. Pushing
stage-integration-tests will run browser and HTTP tests against that
deployment. We’ll make these steps more reliable, add production variants, and
then link them together into automated deployment pipelines.
Many of these were from external contributors, including several first-time
contributions. Here are some of the highlights:
Update chrome_url_overrides for Opera
(BCD PR 559),
from first-time contributor
Zbyněk Eiselt.
Mark add and set methods of Set, Map and WeakMap objects as
partially implemented on IE 11
(BCD PR 586),
from first-time contributor
Ivan Buryak.
Add link to Edge bug report for <a>
(BCD PR 592),
from first-time contributor
Michael Hogg.
Update status for CSS Scroll Snapping properties
(BCD PR 609),
from first-time contributor
Masataka Yakura.
Update IANA timezone name support in Chrome/Opera
(BCD PR 611),
from first-time contributor
jungshik.
Update browser identifier declaration instructions
(BCD PR 625),
Fix schema documentation for flag
(BCD PR 627), and
Add example for Status information
(BCD PR 628),
from first-time contributor
Ra’Shaun Stovall.
Update support data for parseInt treatment of leading zeros
(BCD PR 633),
from first-time contributor
Claude Pache.
Update textarea@autocomplete compat data
(BCD PR 637 and
PR 673),
from first-time contributor
Matt N.
Add sampleRate option to new AudioContext()
(BCD PR 651),
from first-time contributor
Jedipedia.
Add support for page_action for FF for Android
(BCD PR 667),
from first-time contributor
Elad.
Safari has implemented upgrade-insecure-requests
(BCD PR 670),
from first-time contributor
Justyn Temme.
Mozilla gathers for the
All-Hands event in Austin, TX in
December, which gives us a chance to get together, celebrate the year’s
accomplishments, and plan for 2018. Mozilla offices will shut down for the
last full week of December. This doesn’t leave a lot of time for coding.
We’ll continue working on the projects we worked on in November. We’ll convert
more Browser Compatibility data. We’ll tweak the AWS infrastructure. We’ll
eliminate and convert more waffle flags. We’ll watch the interactive examples
and improved compatibility tables, and ship them when ready.
We’ll also take a step back, and ask if we’re spending time and attention on
the most important things. We’ll think about our processes, and how they could
better support our priorities.
But mostly, we’ll try not to mess things up, so that we can enjoy the holidays
with friends and family, and come back refreshed for 2018.
Here’s what happened on the MozMEAO SRE team from November 7th - November 14th.
Current work
Firefox Quantum release
The team actively monitored our bedrock Kubernetes deployments during the release of [Firefox Quantum]
(https://www.mozilla.org/en-US/firefox/). No manual intervention was required during the release.
SRE General
To step up our efforts on the security front, we’ve updated all of our application Docker images to use a few recommended images.
SUMO
an Elastic.co Elasticsearch development instance has been provisioned and is usable by the SUMO development team.
We’re now varying the CloudFront cache on the querystring parameter revision, which is used to refresh embeded live samples when the document is updated.
On October 10, we moved MDN from Mozilla’s SCL3 datacenter to a
Kubernetes cluster in
the AWS us-west2 (Oregon) region. The database move went well, but we
needed five times the web resources as the maintenance mode tests. We were
able to smoothly scale up in the four hours we budgeted for the
migration.
Dave Parfitt and
Ryan Johnson did a great job implementing
a flexible set of deployment tools and monitors, that allowed us to quickly
react to and handle the unexpected load.
The extra load was caused by mdn.mozillademos.org, which serves
user uploads and
wiki-based code samples.
These untrusted resources are served from a different domain so that browsers
will protect MDN users from the worst security issues. I excluded these
resources from the production traffic tests, which turned out to be a mistake,
since they represent 75% of the web traffic load after the move.
Ryan and I worked to get this domain behind a CDN. This included avoiding a
Vary: Cookie header that was being added to all responses
(PR 4469), and adding
caching headers to each endpoint
(PR 4462 and
PR 4476).
We added CloudFront
to the domain on October 26. Now most of these resources are served from
the CloudFront CDN, which is fast and often closer to the MDN user (for
example, served to French users from a server in Europe rather than
California). Over a week, 197 GB was served from the CDN, versus 3 GB (1.5%)
served from Kuma.
There is a reduced load on Kuma as well. The CDN can handle many requests, so
Kuma doesn’t see them at all. The CDN periodically checks with Kuma that content
hasn’t changed, which often requires a short 304 Not Modified rather than
the full response.
Backend requests for attachments have dropped by 45%:
Code samples requests have dropped by 96%:
We continue to use a CDN for our static assets, but not for
developer.mozilla.org itself. We’d have to do similar work to add caching
headers, ideally splitting anonymous content from logged-in content. The
untrusted domain had 4 endpoints to consider, while developer.mozilla.org
has 35 to 50. We hope to do this work in 2018.
Continued Migration of Browser Compatibility Data
The Browser Compatibility Data
project was the most active MDN project in October. Another 700 MDN pages use
the BCD data, bringing us up to 2200 MDN pages, or 35.5% of the pages with
compatibility data.
Daniel D. Beck continues migrating the CSS data,
which will take at least the rest of 2017.
wbamberg continues to update WebExtension and
API data, which needs to keep up with browser releases.
Chris Mills migrated the Web Audio data
with 32 PRs, starting with
PR 433. This data
includes mixin interfaces, and prompted some discussion about how to
represent them in BCD in
issue #472.
Florian Scholz added MDN URLs in
PR 344, which will help
BCD integrators to link back to MDN for more detailed information.
Browser names and versions are an important part of the compatibility data, and
Florian and Jean-Yves Perrier worked to
formalize their representation in BCD. This includes standardization of
the first version, preferring “33” to “33.0”
(PR 447 and more),
and fixing some invalid version numbers
(PR 449 and more).
In November, BCD will add more of this data, allowing automated validation
of version data, and enabling some alternate ways to present compat data.
Florian continues to release a new NPM package each Monday, and
enabled tag-based releases
(PR 565)
for the most recent 0.0.12 release.
mdn-browser-compat-data
had over 900 downloads last month.
Stephanie Hobson and Florian are
collaborating on a
new compat table design
for MDN, based on the BCD data.
The new format summarizes support across desktop and mobile browsers, while
still allowing developers to dive into the implementation details. We’ll ship
this to beta users on 2200 MDN pages in November. See
Beta Testing New Compatability Tables
on Discourse for more details.
Improve Performance of MDN and the Interactive Editor
Page load times have increased with the move to AWS. We’re looking into ways
to increase performance across MDN. You can follow our
MDN Post-migration project
for more details.
We also want to enable the
interactive editor
for all users, but we’re concerned about further increasing page load times. You can
follow the remaining issues in the
interactive-examples repo.
Update Localization of KumaScript Macros
In August, we planned the toolkit we’d use to extract strings from
KumaScript macros (see
bug 1340342).
We put implementation on hold until after the AWS migration. In November,
we’ll dust off the plans and get some sample macros converted. We’re hopeful
the community will make short work of the rest of the macros.
MDN in AWS
The AWS migration project started in
November 2014,
bug 1110799.
The original plan was to switch by summer 2015, but the technical and
organizational hurdles proved harder than expected. At the same time, the team
removed many legacy barriers making Kuma hard to migrate. A highlight of the
effort was the Mozilla All Hands in December 2015, where the team merged
several branches of work-in-progress code
to get Kuma running in Heroku.
Thanks to
Jannis Leidel,
Rob Hudson,
Luke Crouch,
Lonnen,
Will Kahn-Greene,
David Walsh,
James Bennet,
cyliang,
Jake,
Sean Rich,
Travis Blow,
Sheeri Cabral,
and everyone else who worked on or influenced this first phase of the project.
The migration project
rebooted in Summer 2016.
We switched to targeting Mozilla Marketing’s deployment environment. I split
the work into smaller steps leading up to AWS. I thought each step would take
about a month. They took about 3 months each. Estimating is hard.
Changes to MDN Services
MDN no longer uses Apache to serve files and proxy Kuma.
Instead, Kuma serves requests directly with
gunicorn with the
meinheld worker. I did
some analysis
in January, and
Dave Parfitt and
Ryan Johnson led the effort to port Apache
features to Kuma:
Static assets (CSS, JavaScript, etc.) are served directly with
WhiteNoise.
Kuma handles the domain-based differences between the main website
and the untrusted domain.
Miscellaneous files like robots.txt, sitemaps, and legacy files (from the
early days of MDN) are served directly.
Kuma adds security headers to responses.
Another big change is how the services are run. The base unit of implementation
in SCL3 was multi-purpose virtual machines (VMs). In AWS, we are switching to
application-specific Docker
containers.
In SCL3, the VMs were split into 6 user-facing web servers and 4 backend
Celery servers.
In AWS, the EC2 servers act as Docker hosts. Docker uses
operating system virtualization,
which has several advantages over machine virtualization for our use cases.
The Docker images are distributed over the EC2 servers, as chosen by Kubernetes.
The SCL3 servers were maintained as long-running servers, using
Puppet to install
security updates and new software. The servers were multi-purpose, used
for Kuma, KumaScript, and backend Celery processes. With Docker, we
instead use a Python/Kuma image and a
node.js/KumaScript image to
implement MDN.
The Python/Kuma image is configurable through
environment variables to run in different
domains (such as staging or production), and to be configured as one of our
three main Python services:
web - User-facing Kuma service
celery - Backend Kuma processes outside of the request loop
api - A backend Kuma service, used by KumaScript to render pages. This
avoids an issue in SCL3 where KumaScript API calls were competing with MDN
user requests.
Our node.js/KumaScript service is also configured via environment variables,
and implements the fourth main service of MDN:
kumascript - The node.js
service that renders wiki pages
Building the Docker images involves installing system software, installing the
latest code, creating the static files, compiling translations, and preparing
other run-time assets. AWS deployments are the relatively fast process of
switching to newer Docker images. This is an improvement over SCL3, which
required doing most of the work during deployment while developers watched.
An Introduction to Kubernetes
Kubernetes
is a system for automating the deployment, scaling, and management of
containerized applications. Kubernetes’s view of MDN looks like this:
A big part of understanding Kubernetes is learning the vocabulary.
Kubernetes Concepts is a good place
to start. Here’s how some of these concepts are implemented for MDN:
Ten EC2 instances in AWS are configured as
Nodes, and
joined into a Kubernetes Cluster. Our “Portland Cluster” is in the
us-west2 (Oregon) AWS region. Nine Nodes are available for application
usage, and the master Node runs the Cluster.
The mdn-prodNamespace
collects the resources that need to collaborate to make MDN work. The
mdn-stage Namespace is also in the Portland Cluster, as well as other
Mozilla projects.
A Service
defines a service provided by an application at a TCP port. For example,
a webserver provides an HTTP service on port 80.
The web service is connected to the outside world via an AWS
Elastic Load Balancer (ELB), which can reach it at
https://developer.mozilla.org (the main site) and
https://mdn.mozillademos.org (the untrusted resources).
The api and kumascript services are available inside the cluster,
but not routed to the outside world.
celery doesn’t accept HTTP requests, and so it doesn’t get a Service.
The application that provides a service is defined by a
Deployment,
which declares what Docker image and tag will be used, how many replicas are
desired, the CPU and memory budget, what disk volumes should be mounted, and
what the environment configuration should be.
A Kubernetes Deployment is a higher-level object, implemented with a
ReplicaSet,
which then starts up several
Pods
to meet the demands. ReplicaSets are named after the Service plus a random
number, such as web-61720, and the Pods are named after the ReplicaSets
plus a random string, like web-61720-s7l.
ReplicaSets and Pods come into play when new software is rolled out. The
Deployment creates a new ReplicaSet for the desired state, and creates new Pods
to implement it, while it destroys the Pods in the old ReplicaSet. This
rolling deployment ensures that the application is fully available while new
code and configurations are deployed. If something is wrong with the new code
that makes the application crash immediately, the deployment is cancelled. If
it goes well, the old ReplicaSet is kept around, making it easier to rollback
for subtler bugs.
This deployment style puts the burden on the developer to ensure that the two
versions can run at the same time. Caution is needed around database changes
and some interface changes. In exchange, deployments are smooth and safe with no
downtime. Most of the setup work is done when the Docker images are created,
so deployments take about a minute from start to finish.
Kubernetes takes control of deploying the application and ensures it keeps
running. It allocates Pods to Nodes (called Scheduling), based on the CPU
and memory budget for the Pod, and the existing load on each Node. If a Pod
terminates, due to an error or other cause, it will be restarted or recreated.
If a Node fails, replacement Pods will be created on surviving Nodes.
The Kubernetes system allows several ways to scale the application. We used
some for handling the unexpected load of the user attachments:
We went from 10 to 11 Nodes, to increase the total capacity of the Cluster.
We scaled the web Deployment from 6 to 20 Pods, to handle more
simultaneous connections, including the slow file requests.
We scaled the celery Deployment from 6 to 10 Pods, to handle the load of
populating the cold cache.
We adjusted the gunicorn worker threads from 4 to 8, to increase
the simultaneous connections
We rolled out new code to improve caching
There are many more details, which you can explore by reading our
configuration files
in the infra repo. We use
Jinja for our templates, which we find more
readable than the Go templates
used by many Kubernetes projects. We’ll continue to refine these as we
adjust and improve our infrastructure. You can see our current tasks by
following the
MDN Post-migration project.