The Snippets Service allows
Mozilla to communicate with Firefox users directly by placing a snippet of text
and an image on their new tab page. Snippets share exciting news from the
Mozilla World, useful tips and tricks based on user activity and sometimes
jokes.
To achieve personalized, activity based messaging in a privacy respecting and
efficient manner, the service creates a Bundle of Snippets per locale. Bundles
are HTML documents that contain all Snippets targeted to a group of users,
including their Style-Sheets, images, metadata and the JS decision engine.
The Bundle is transferred to the client where the locally executed decision
engine selects a snippet to display. A carefully designed system with multiple
levels of caching takes care of the delivery. One layer of caching is a
CloudFront CDN.
The problem
During the last months we observed a significant uptake of our CDN costs as
Mozilla’s Lifecycle Marketing Team was increasing the number of Snippets for the
English language from about 10 to 150.
The Bundle file-size increased from about 200 KiB to more than 4MiB. Given that
Firefox requests new Bundles every 4 hours that translated to about 75 TB of
transferred data per day or about 2.25 PB (yes, that’s Petabytes!) of data
transferred per month, despite the local browser caching.
The solution
Bundles include everything a Snippet needs to be displayed: the targeting rules,
the text and the image in a base64 encoded format. First hypothesis was that we
could reduce the Bundle size by reducing the image size. We run
optipng against all images in the bundle to
prove the hypothesis. The images were optimized but the Bundle shrunk for only
100KiB, about 2.5% of the total size.
Second hypothesis was to replace the images with links to images. Since not all
Snippets are displayed to all users, we can benefit by not transferring all
images to all users. This reduced the Bundle size to 1.1MiB without accounting
for the size of the images that will be transferred.
Third hypothesis was to replace GZip with Brotli compression.
Brotli is a modern compression algorithm
supported by Firefox and all other major browsers as alternative method for HTTP
Compression.
Brotli reduced the size of the bundle down to 500KiB, about 25% of the size of
the CloudFront GZip mechanism which compressed the bundle to about 2.2MiB.
Since CloudFront does not support on the fly Brotli compression, we prepare and
compress the Bundles on the app level before uploading to S3. By adding the
correct Content-Encoding headers, the S3 objects are ready to be served by the
CDN.
Conclusions
Although all three solutions can reduce the Bundle size, the third provided the
best performance to effort ratio and we proceeded with implementation. Next day
reports graphed a significant drop on costs marking the project a success. From
the original average of 75TB transferred data per day, we dropped down to 15TB.
We are going to further improve in the future by moving the images outside the
Bundle.
It’s clear that Brotli compression can achieve significantly higher compression
rates compared to GZip at the expense of more CPU time. Even though our CDN of
choice doesn’t support Brotli, assets can be pre-compressed and uploaded ready
for use.
On its face www.mozilla.org doesn’t look like it would be a complex application to write, maintain, or run.
But when you throw over 100 million unique visitors per week at any site it can complicate things quickly. Add to that translations
of the content into over 100 languages and you can start to get the idea of where it might get interesting. So we take every
opportunity to simplify and reduce hosting complexity and cost we can get. This is the place from which the idea to
switch to using SQLite for our database needs in production was born.
The traditional answer to the question “should we use SQLite for our web application in production?” is an emphatic NO. But,
again, bedrock is different. It uses its database as a read-only data store as far as the web application is concerned. We run a
single data updater process (per cluster) that does the writing of the updates to the DB server that all of the app instances use.
Most of bedrock is static content coded directly into templates, but we use the database to store things like product release
notes, security advisories, blog posts, twitter feeds, and the like; basically anything that needs updating more often than
we deploy the site. SQLite is indeed a bad solution for a typical web application which is writing and reading data in its
normal function because SQLite rightly locks itself to a single writer at a time, and a web app with any traffic almost certainly
needs to write more than one thing at a time. But when you only need to read data then SQLite is an incredibly fast and robust
solution to data storage.
Data Updates
The trick with a SQLite store is refreshing the data. We do still need to update all those bits of data I mentioned before. Our
solution to this is to keep the aforementioned single process updating the data, but this time it will update a local SQLite file,
calculate a hash of said file, and upload the database and its metadata (a JSON file that includes the SHA256 hash) to AWS S3.
The Docker containers for the web app will also have a separate process running that will check for a new database file on a schedule
(every 5 min or so), compare its metadata to the one currently in use, download the newer database, check its hash against the one from
the metadata to ensure a successful download, and swap it with the old file atomically with the web app none the wiser. Using Python’s
os.rename function to swap the database file ensures an atomic switch with zero errors due to a missing DB file. We thought about using
symlinks for this but it turns out to be harder to re-point a symlink than to just do the rename which atomically overwrites the old file
with the new (I’m pretty sure it’s actually just updating the inode to which the name points but I’ve not verified that).
When all of this is working it means that bedrock no longer requires a database server. We can turn off our AWS RDS instances and never
have to worry about DB server maintenance or downtime. The site isn’t all that much faster since like I said it’s mostly spending time
rendering Jinja templates, but it is a lot cheaper to run and less likely to go down. We are also making DB schema changes easier and
more error-free since the DB filenames include the git hash of the version of bedrock that created it. This means that the production
Docker images contain an updated and migrated database file, and it will only download an update once the same version of the site
is the one producing database files.
And production advantages aren’t the only win: we also have a much more simple development bootstrap process now since getting all of
the data you need to run the full site is a simple matter of either running bin/run-db-download.py or pulling the prod docker image
(mozorg/bedrock:latest) which will contain a decently up-to-date database and the machinery to keep it updated that requires no AWS
credentials since the database is publicly available.
Verifying Updates
Along with actually performing the updates in every running instance of the site we also need to be able to monitor that said updates
are actually happening. To this end we created a page on the site that will give us some
data on when the last time that instance ran the update, the git hash of bedrock that is currently running, the git hash used to
create the database in use, and how long ago said database was updated. This page will also respond with a 500 code instead of the normal 200 if the DB and L10n update tasks happened too long ago. At the time of writing the updates happen every 5 minutes, and the page would
start to fail at 10 minutes of no updates. Since the updates and the site are running in separate processes in the Docker container, we
need a way for the cron process to communicate to the web server the time of the last run for these tasks. For this we decided on
files in /tmp that the cron jobs will simply touch, and the web server can get the mtime (check out
the source code for details).
To actually monitor this view we are starting with simply using New Relic Synthetics pings of this URL at each of our clusters
(currently oregon-b, tokyo, and frankfurt). This is a bit suboptimal because it will only be checking whichever pod happens to respond
to that particular request. In the near future our plan is to move to creating another process type for bedrock
that will query Kubernetes for all of the running pods in the cluster and ping each of them on a schedule. We’ll then ping
Dead Man’s Snitch (DMS) on every fully successful round of checks, and if they fail more than a
couple of times in a cluster we’ll be notified. This will mean that bedrock will be able to monitor itself for data update troubles.
We also ping DMS on every database update run, so we should know quickly if either database uploading or downloading is having trouble.
Conclusions
We obviously don’t yet know the long-term effects and consequences of this change (as of writing it’s been in production less than a day),
but for now our operational complexity and costs are lower. I feel confident calling it a win for our deployment reliability for now.
Bedrock may eventually move toward having a large part of it pre-generated and hosted statically, but for now this version feels like the
one that will be as robust, resilient, and reliable as possible while still being one big Django web application.
In February, we asked the MDN community to help convert compatibility data
to the browser-compat-data
repository. Florian Scholz led this effort,
starting with a conference talk and blog post
last month. He created
GitHub issues
to suggest migration tasks, and
added a
call to action
on the old pages:
The response from the community has been overwhelming. There were 203 PRs
merged in February, and 96 were from 23 first-time contributors. Existing
contributors such as Mark Boas,
Chris Mills, and
wbamberg kept up their January pace.
The PRs were reviewed for the correctness of the conversion as well as
ensuring the data was up to date, and Florian,
Jean-Yves Perrier, and
Joe Medley have done the most reviews.
In February, the project jumped from 43% to 57% of the data converted,
and the data is better than ever.
There are two new tools using the data.
SphinxKnight is working on
compat-tester, which scans
an HTML, CSS, or Javascript file for compatibility issues with a user-defined
set of browsers. K3N is working on
mdncomp, which displays compatibility
data on the command line:
If you have a project using the data, let us know about it!
We continue to improve and expand the interactive examples, such as a
clip-path
demo from Rachel Andrew:
We’re expanding the framework to allow for HTML examples, which often
need a mix of HTML and CSS to be interesting. Like previous efforts,
we’re using
user testing
to develop this feature. We show the work-in-progress, like the
<table> demo,
to an individual developer, watch how the demo is used and ask for feedback,
and then iterate on the design and implementation.
The demos have gone well, and the team will firm up the implementation and
write more examples to prepare for production. The team will also work on
expanding test coverage and formalizing the build tools in a new package.
We made
many changes last month
to improve the performance and reliability of MDN. They worked, and we’ve
entered a new period of calm. We’ve had a month without 3 AM downtime or
performance alerts, for the first time since the move to AWS. The site is
responding more smoothly, and easily handling MDN’s traffic.
This has freed us to focus on longer term fixes and on the goals for the
quarter. One of those is to serve MDN from behind a CDN, which will further
reduce server load and may have a huge impact on response time.
Ryan Johnson is getting the code ready.
He switched to Django’s middleware for handling ETag creation
(PR 4647), which allowed him to
remove some buggy caching code
(PR 4648). Ryan is now working
through the many endpoints, adding caching headers and cleaning up tests
(PR 4676,
PR 4677, and others). Once this
work is done, we’ll add the CDN that will cache content based on the directives
in the headers.
My focus has been on the
Django 1.11 upgrade,
since
Django 1.8 is scheduled to lose support in April.
This requires updating third-party libraries like django-tidings
(PR 4660) and
djangorestframework
(PR 4664
from Safwan Rahman). We’re moving away
from other requirements, such as dropping dbgettext
(PR 4669). We’ve taken care of
the most obvious upgrades, but there are 142,000 lines of Python in our
libraries, so we expect more surprises as we get closer to the switch.
Once the libraries are compatible with Django 1.11, the remaining issues
will be with the Kuma codebase. Some changes are small and easy,
such as a one-liner in PR 4684.
Some will be quite large. Our code that serves up locale-specific content,
such as
reverse
and
LocaleURLMiddleware,
are incompatible, and we’ll have to swap some of our
oldest code
for
Django’s version.
Adds compat data for AnimationEffectTiming
(PR 1000),
Adds compat data for AnimationEffectTimingReadOnly
(PR 1001),
and
6 more PRs
to BCD from
Benny Powers.
Adding border radius example with recommended changes.
(PR 546),
Adding list-style css example.
(PR 547),
and
Adding supporting samples for border corners to support request #502
(PR 553),
to Interactive Examples from
Helmut Granda.
Add example for list-style-type CSS property
(PR 594),
Add example for list-style-image CSS property
(PR 600),
and
9 more PRs
from
Daniel D. Beck
(first contributions to Interactive Examples).
Browser Kompatibilität -> Browserkompatibilität
(PR 570),
add german translation
(PR 617),
and
add german translation
(PR 618),
to KumaScript from
schlagi123.
Add WebAuthn spec
(PR 574),
and
Add Web Authentication API
(PR 615),
to KumaScript from
Adam Powers.
Update Japanese translation
(PR 590),
Added Japanese translation
(PR 591),
and
Added Japanese translations for CompatTable.
(PR 593),
to KumaScript from
Masahiro Fujimoto.
bug 957802: Add Code of Conduct, Update Contributing doc
(Kuma PR 4674),
from
me.
Fix bug 1434558, swap search and toolbox components
(Kuma PR 4656),
from
Schalk Neethling.
Check that your profile image still works with round photos!
Starting March 2, the MDN developers move from Marketing to Emerging
Technologies. We’ll be working on the details of this transition in March and
the coming months. That will include planning a infrastructure transition, and
finding a new home for the MDN Changelog.
Josh Mize led the effort to integrate MDN into
the marketing technology and processes. He helped with our move to
Docker-based development and deployment, implemented
demo deploys, advocated for
a read-only and statically-generated deployment, and worked out details of
the go-to-AWS strategy, such as file serving and the master database
transfer. Josh keeps up to date on the infrastructure community, and knows
what tech is reliable, what the community is excited about, and what the
next best practices will be.
Paul McLanahan helped review PRs when we had a
single backend developer. His experience migrating
bedrock to AWS was invaluable, and
his battle-tested
django-redirect-urls made it
possible to migrate away from Apache and get 10 years of redirects under
control.
Ben Sternthal made the transition
into Marketing possible. He made us feel welcome from day one, hired
some amazing contractors
to help with the dark days of the
2016 spam attack, hired
Ryan Johnson, and
worked for the resources and support to move to AWS. He created a space where
developers could talk about what is important to us, where we spent time
and effort on technical improvements and career advancement, and where
technical excellence was balanced with features and experiments.
MDN is on a firmer foundation after the time spent in MozMEAO, and is ready
for the next chapter in its 13 year history.
Ryan Johnson, Schalk Neethling, and I will join the Advanced Development team
in Emerging Technologies, reporting to
Faramarz Rashed. The Advanced Development team
has been working on various ET projects, most recently
Project Things,
an Internet of Things (IoT) project that is focused on decentralization,
security, privacy, and interoperability. It’s a team that is focused on
getting fresh technology into users’ hands. This is a great environment for
the next phase of MDN, as we build on the more stable foundation and expand our
reach.
We’re traveling to the Mozilla Paris Office
in March. We’ll have team meetings on Tuesday, March 13 through Thursday,
March 15, to plan for the next three months and beyond.
From Friday, March 16 through Sunday, March 18, we’ll have the third
Hack on MDN event. The last one
was in
2015 in Berlin,
and the team is excited to return to this format. The focus of the Paris
event will be the
Browser Compat Data project.
We expect to build some tools using the data, alternative displays of
compat information, and improve the migration and review processes.
One of our goals for the year is to improve page load times on MDN. We’re
building on a similar SEO project last year, and looking for an external
expert to measure MDN’s performance and recommend next steps. Take a look at
our
Request for Proposal.
We plan to select the top bidders by March 30, 2018.
Here’s what happened on the MozMEAO SRE team from February 16 - February 28th.
Current work
support.mozilla.org (SUMO)
Most of our recent efforts have been related to the SUMO migration to AWS. We’ll be running the stage and production environments in our Oregon-A and Oregon-B clusters, with read-only failover in Frankfurt.
We’ve provisioned a MySQL instance in Oregon, and production data is currently being replicated to this instance from SUMO’s current home in the SCL3 datacenter.
As you may have seen in several of our
SREstatusreports,
we’re moving all of our webapp hosting from Deis to Kubernetes (k8s). As part of that
we’ve also been doing some additional thinking about the security of our deployments.
One thing we’ve not done as good a job as we should is with Django’s ALLOWED_HOSTS setting.
We should have been adding all possible hosts to that list, but it seems we used to occasionally
leave it set to ['*']. This isn’t great, but also isn’t the end-of-the-world since
we don’t knowingly construct URLs using the info sent via the Host header. In an effort
to cover all bases we’ve decided to improve this. Unfortunately our particular combination
of technologies doesn’t make this as
easy as we thought it would (story of our lives).
AWS ELB Health Checks
Here’s the thing: Amazon Web Services’ (AWS) Elastic Load Balancers (ELB) do not have many configuration options for
their health checks. These checks ensure that your app on a particular node in your cluster
is working as expected. If the check fails the ELB will remove the node from the list of nodes
to which it will route requests for your app. However, because it’s hitting the nodes directly
it doesn’t rely on DNS and directly requests the IP address and port, and it doesn’t allow you
to specify custom headers (e.g. the Host header). It also can’t do HTTPS because we terminate
TLS connections at the ELB, so the app nodes speak only plain HTTP back to the ELB. All of that
means that our health check endpoint needs to do two unique things: allow HTTP connections and
allow the IP address that the ELB requests as a valid Host header. The first bit is easy enough
when using Django’s in-built SecurityMiddleware since it supports the SECURE_REDIRECT_EXEMPT
setting. It’s this second requirement that gets more interesting when combined with k8s.
K8s Routing
The way I understand it (and I’m admittedly no expert) is that k8s (at least the way we use it)
sets up a NodePort per app (or namespace). To hit that app you can hit any node in the cluster
at that port and that node will route you to one of the nodes that is running a pod
for that app. The important bit for us is that the node that serves this request is not
necessarily the one that the ELB sent it to. So the Host header may contain an IP address for the
node that was initially hit, but not necessarily for the node that serves the request. This means that we can’t
simply add the IP of the host to the ALLOWED_HOSTS list when the app starts. We could get more info
from AWS’ metadata service endpoint, but for security reasons we block that service from all of our
nodes.
So, the approach could then be to simply add all of the IPs for all of the nodes in the cluster to the
ALLOWED_HOSTS setting and call it done. The problem with this happens when there is a scaling event.
When a node is killed and a new one started, or the cluster is scaled to include more nodes, you’d need
to have a way to inform every running pod of this change so they could get the new list of IPs. If they
didn’t update the list the new node(s) could be immediately excluded from the cluster because health checks would
return 400s since their IP (host) would not be allowed by Django.
Enter django-allow-cidr
The way we decided to solve this was by implementing a Django middleware that would allow a range of IP
addresses defined by a CIDR (Classless Inter-Domain Routing). We’ve released this middleware in a
Django package called django-allow-cidr. The way it works is to store the normal hosts you’ve set
in your ALLOWED_HOSTS setting, change that setting to ['*'] in order to bypass Django’s default
host header checking in the HttpRequest.get_host() method, and do the checking itself.
It does this checking via the same methods as Django would have, but if those methods fail it does
a secondary check using the IP ranges you’ve defined in an ALLOWED_CIDR_NETS setting. It creates
netaddr.IPNetwork instances from the CIDRs in that list and will check any host that isn’t valid
based on your original ALLOWED_HOSTS setting. Failing both of those checks will result in an
immediate return of a 400 response.
Conclusion
That was a long way to go to get to some simple health checking, but we believe it was the right move for the
reliability and security of our Django apps hosted in our k8s infrastructure on AWS. Please check out the
repo for django-allow-cidr on Github if you’re interested in the code. Our hope is that releasing this as
a general use package will help others that find themselves in our situation, as well as helping ourselves
to do less copypasta coding around our variouswebprojects.