The Snippets Service allows
Mozilla to communicate with Firefox users directly by placing a snippet of text
and an image on their new tab page. Snippets share exciting news from the
Mozilla World, useful tips and tricks based on user activity and sometimes
To achieve personalized, activity based messaging in a privacy respecting and
efficient manner, the service creates a Bundle of Snippets per locale. Bundles
are HTML documents that contain all Snippets targeted to a group of users,
including their Style-Sheets, images, metadata and the JS decision engine.
The Bundle is transferred to the client where the locally executed decision
engine selects a snippet to display. A carefully designed system with multiple
levels of caching takes care of the delivery. One layer of caching is a
During the last months we observed a significant uptake of our CDN costs as
Mozilla’s Lifecycle Marketing Team was increasing the number of Snippets for the
English language from about 10 to 150.
The Bundle file-size increased from about 200 KiB to more than 4MiB. Given that
Firefox requests new Bundles every 4 hours that translated to about 75 TB of
transferred data per day or about 2.25 PB (yes, that’s Petabytes!) of data
transferred per month, despite the local browser caching.
Bundles include everything a Snippet needs to be displayed: the targeting rules,
the text and the image in a base64 encoded format. First hypothesis was that we
could reduce the Bundle size by reducing the image size. We run
optipng against all images in the bundle to
prove the hypothesis. The images were optimized but the Bundle shrunk for only
100KiB, about 2.5% of the total size.
Second hypothesis was to replace the images with links to images. Since not all
Snippets are displayed to all users, we can benefit by not transferring all
images to all users. This reduced the Bundle size to 1.1MiB without accounting
for the size of the images that will be transferred.
Third hypothesis was to replace GZip with Brotli compression.
Brotli is a modern compression algorithm
supported by Firefox and all other major browsers as alternative method for HTTP
Brotli reduced the size of the bundle down to 500KiB, about 25% of the size of
the CloudFront GZip mechanism which compressed the bundle to about 2.2MiB.
Since CloudFront does not support on the fly Brotli compression, we prepare and
compress the Bundles on the app level before uploading to S3. By adding the
correct Content-Encoding headers, the S3 objects are ready to be served by the
Although all three solutions can reduce the Bundle size, the third provided the
best performance to effort ratio and we proceeded with implementation. Next day
reports graphed a significant drop on costs marking the project a success. From
the original average of 75TB transferred data per day, we dropped down to 15TB.
We are going to further improve in the future by moving the images outside the
It’s clear that Brotli compression can achieve significantly higher compression
rates compared to GZip at the expense of more CPU time. Even though our CDN of
choice doesn’t support Brotli, assets can be pre-compressed and uploaded ready
On its face www.mozilla.org doesn’t look like it would be a complex application to write, maintain, or run.
But when you throw over 100 million unique visitors per week at any site it can complicate things quickly. Add to that translations
of the content into over 100 languages and you can start to get the idea of where it might get interesting. So we take every
opportunity to simplify and reduce hosting complexity and cost we can get. This is the place from which the idea to
switch to using SQLite for our database needs in production was born.
The traditional answer to the question “should we use SQLite for our web application in production?” is an emphatic NO. But,
again, bedrock is different. It uses its database as a read-only data store as far as the web application is concerned. We run a
single data updater process (per cluster) that does the writing of the updates to the DB server that all of the app instances use.
Most of bedrock is static content coded directly into templates, but we use the database to store things like product release
notes, security advisories, blog posts, twitter feeds, and the like; basically anything that needs updating more often than
we deploy the site. SQLite is indeed a bad solution for a typical web application which is writing and reading data in its
normal function because SQLite rightly locks itself to a single writer at a time, and a web app with any traffic almost certainly
needs to write more than one thing at a time. But when you only need to read data then SQLite is an incredibly fast and robust
solution to data storage.
The trick with a SQLite store is refreshing the data. We do still need to update all those bits of data I mentioned before. Our
solution to this is to keep the aforementioned single process updating the data, but this time it will update a local SQLite file,
calculate a hash of said file, and upload the database and its metadata (a JSON file that includes the SHA256 hash) to AWS S3.
The Docker containers for the web app will also have a separate process running that will check for a new database file on a schedule
(every 5 min or so), compare its metadata to the one currently in use, download the newer database, check its hash against the one from
the metadata to ensure a successful download, and swap it with the old file atomically with the web app none the wiser. Using Python’s
os.rename function to swap the database file ensures an atomic switch with zero errors due to a missing DB file. We thought about using
symlinks for this but it turns out to be harder to re-point a symlink than to just do the rename which atomically overwrites the old file
with the new (I’m pretty sure it’s actually just updating the inode to which the name points but I’ve not verified that).
When all of this is working it means that bedrock no longer requires a database server. We can turn off our AWS RDS instances and never
have to worry about DB server maintenance or downtime. The site isn’t all that much faster since like I said it’s mostly spending time
rendering Jinja templates, but it is a lot cheaper to run and less likely to go down. We are also making DB schema changes easier and
more error-free since the DB filenames include the git hash of the version of bedrock that created it. This means that the production
Docker images contain an updated and migrated database file, and it will only download an update once the same version of the site
is the one producing database files.
And production advantages aren’t the only win: we also have a much more simple development bootstrap process now since getting all of
the data you need to run the full site is a simple matter of either running
bin/run-db-download.py or pulling the prod docker image
mozorg/bedrock:latest) which will contain a decently up-to-date database and the machinery to keep it updated that requires no AWS
credentials since the database is publicly available.
Along with actually performing the updates in every running instance of the site we also need to be able to monitor that said updates
are actually happening. To this end we created a page on the site that will give us some
data on when the last time that instance ran the update, the git hash of bedrock that is currently running, the git hash used to
create the database in use, and how long ago said database was updated. This page will also respond with a 500 code instead of the normal 200 if the DB and L10n update tasks happened too long ago. At the time of writing the updates happen every 5 minutes, and the page would
start to fail at 10 minutes of no updates. Since the updates and the site are running in separate processes in the Docker container, we
need a way for the cron process to communicate to the web server the time of the last run for these tasks. For this we decided on
/tmp that the cron jobs will simply
touch, and the web server can get the
mtime (check out
the source code for details).
To actually monitor this view we are starting with simply using New Relic Synthetics pings of this URL at each of our clusters
(currently oregon-b, tokyo, and frankfurt). This is a bit suboptimal because it will only be checking whichever pod happens to respond
to that particular request. In the near future our plan is to move to creating another process type for bedrock
that will query Kubernetes for all of the running pods in the cluster and ping each of them on a schedule. We’ll then ping
Dead Man’s Snitch (DMS) on every fully successful round of checks, and if they fail more than a
couple of times in a cluster we’ll be notified. This will mean that bedrock will be able to monitor itself for data update troubles.
We also ping DMS on every database update run, so we should know quickly if either database uploading or downloading is having trouble.
We obviously don’t yet know the long-term effects and consequences of this change (as of writing it’s been in production less than a day),
but for now our operational complexity and costs are lower. I feel confident calling it a win for our deployment reliability for now.
Bedrock may eventually move toward having a large part of it pre-generated and hosted statically, but for now this version feels like the
one that will be as robust, resilient, and reliable as possible while still being one big Django web application.
Here’s what happened in February to the
code, data, and tools
MDN Web Docs:
Here’s the plan for March:
Done in February
In February, we asked the MDN community to help convert compatibility data
to the browser-compat-data
repository. Florian Scholz led this effort,
starting with a conference talk and blog post
last month. He created
to suggest migration tasks, and
call to action
on the old pages:
The response from the community has been overwhelming. There were 203 PRs
merged in February, and 96 were from 23 first-time contributors. Existing
contributors such as Mark Boas,
Chris Mills, and
wbamberg kept up their January pace.
The PRs were reviewed for the correctness of the conversion as well as
ensuring the data was up to date, and Florian,
Jean-Yves Perrier, and
Joe Medley have done the most reviews.
In February, the project jumped from 43% to 57% of the data converted,
and the data is better than ever.
There are two new tools using the data.
SphinxKnight is working on
compat-tester, which scans
set of browsers. K3N is working on
mdncomp, which displays compatibility
data on the command line:
If you have a project using the data, let us know about it!
We continue to improve and expand the interactive examples, such as a
demo from Rachel Andrew:
We’re expanding the framework to allow for HTML examples, which often
need a mix of HTML and CSS to be interesting. Like previous efforts,
to develop this feature. We show the work-in-progress, like the
to an individual developer, watch how the demo is used and ask for feedback,
and then iterate on the design and implementation.
The demos have gone well, and the team will firm up the implementation and
write more examples to prepare for production. The team will also work on
expanding test coverage and formalizing the build tools in a new package.
many changes last month
to improve the performance and reliability of MDN. They worked, and we’ve
entered a new period of calm. We’ve had a month without 3 AM downtime or
performance alerts, for the first time since the move to AWS. The site is
responding more smoothly, and easily handling MDN’s traffic.
This has freed us to focus on longer term fixes and on the goals for the
quarter. One of those is to serve MDN from behind a CDN, which will further
reduce server load and may have a huge impact on response time.
Ryan Johnson is getting the code ready.
He switched to Django’s middleware for handling
(PR 4647), which allowed him to
remove some buggy caching code
(PR 4648). Ryan is now working
through the many endpoints, adding caching headers and cleaning up tests
PR 4677, and others). Once this
work is done, we’ll add the CDN that will cache content based on the directives
in the headers.
My focus has been on the
Django 1.11 upgrade,
Django 1.8 is scheduled to lose support in April.
This requires updating third-party libraries like
(PR 4660) and
from Safwan Rahman). We’re moving away
from other requirements, such as dropping
(PR 4669). We’ve taken care of
the most obvious upgrades, but there are 142,000 lines of Python in our
libraries, so we expect more surprises as we get closer to the switch.
Once the libraries are compatible with Django 1.11, the remaining issues
will be with the Kuma codebase. Some changes are small and easy,
such as a one-liner in PR 4684.
Some will be quite large. Our code that serves up locale-specific content,
are incompatible, and we’ll have to swap some of our
There were 413 PRs merged in February:
147 of these were from first-time contributors:
- Update String method support in Node.js
(BCD PR 938),
- Add Edge support of
(BCD PR 939),
animation-name is supported since Edge 12
(BCD PR 951),
- Add RTCCertificate compat data
Adding compat data for RTCConfiguration
48 more PRs
to BCD from
- Change ordering for String.prototype.includes
(BCD PR 953),
- Added chrome and opera support of min-height:fill-available
(BCD PR 962),
Abel Serrano Juste.
- String.prototype.includes is incorrectly marked as deprecated
(BCD PR 974),
- Add compat data for Animation
Adding compat data for Blob
6 more PRs
to BCD from
- Adds compat data for AnimationEffectTiming
Adds compat data for AnimationEffectTimingReadOnly
6 more PRs
to BCD from
Array.prototype.values() shipped in FF and Chrome
(BCD PR 1014),
- Add comp data for BroadcastChannel
add compat data for BudgetService and BudgetState
6 more PRs
to BCD from
- Update referrer policy compat data to note some values as standard
(first contribution to BCD).
- Adding compat data for HTML global attributes
(BCD PR 1089),
- Add Animation.updatePlaybackRate
(BCD PR 1106),
- Add PerformanceNavigationTiming
to BCD from
- Add nodejs compat for Object.entries()
Add nodejs compat for Object.getOwnPropertyDescriptors()
to BCD from
- Correct support for css
(BCD PR 1133),
- Update npm dependencies install command
(BCD PR 1136),
- Add compat data for Console
(BCD PR 1145),
- Added Edge version that supports exponentiation
(BCD PR 1153),
Vijay Koushik, S..
- Updated versions for node async/await
(BCD PR 1182),
- HTMLHtmlElement compat data
(BCD PR 1201),
- adding compat data for HTMLHRElement
(BCD PR 1202),
- Adding flex-wrap example
8 more PRs
to Interactive Examples from
- Update class expression example
(Interactive Examples PR 539),
- Fix typo
(Interactive Examples PR 540),
Hidde de Vries.
- Adding border radius example with recommended changes.
Adding list-style css example.
Adding supporting samples for border corners to support request #502
to Interactive Examples from
list-style-position css example
(Interactive Examples PR 559),
- add css font-size examples
(Interactive Examples PR 567),
- CSS examples: example for z-index
(Interactive Examples PR 570),
- CSS interactive examples: Add example for width.
(Interactive Examples PR 571),
- Add resize example
(Interactive Examples PR 582),
- Add example for list-style-type CSS property
Add example for list-style-image CSS property
9 more PRs
Daniel D. Beck
(first contributions to Interactive Examples).
- feat: adds cursor interactive examples
(Interactive Examples PR 597),
- Issue 561 Added example for vertical-align in text context
(Interactive Examples PR 606),
- Browser Kompatibilität -> Browserkompatibilität
add german translation
add german translation
to KumaScript from
- Add WebAuthn spec
Add Web Authentication API
to KumaScript from
- Update Russian translations
(KumaScript PR 578),
- Add Arabic language to
(KumaScript PR 586),
- Update Japanese translation
Added Japanese translation
Added Japanese translations for
to KumaScript from
- Modify Japanese translation of spec status
(KumaScript PR 596),
- Adding missing fr strings
(KumaScript PR 604),
- added Chinese simplified translation for
(KumaScript PR 637),
- fix bug 1419466 - added Jinja2 extension of translating the 404 page
(Kuma PR 4655),
- bug 951180: Position labels after checkboxes
(Kuma PR 4682),
- Microsoft CSS changes
(Data PR 156),
- Update align-items to match current spec
(first contribution to Data).
- Correct and add some Japanese translations
(Data PR 175),
Other significant PRs:
Planned for March
We’ll continue with the compatibility migration, interactive examples, the CDN,
and the Django 1.11 migration in March.
Starting March 2, the MDN developers move from Marketing to Emerging
Technologies. We’ll be working on the details of this transition in March and
the coming months. That will include planning a infrastructure transition, and
finding a new home for the MDN Changelog.
Stephanie Hobson and I joined Marketing
Engineering and Operations in March 2016, back when it was still Engagement
Engineering. EE was already responsible for
50% of Mozilla’s web traffic
with www.mozilla.org, and adding
support.mozilla.org (34%) and
developer.mozilla.org (16%) put
99% of Mozilla’s web presence under one engineering group. MDN
benefited from this amazing team in many ways:
- Josh Mize led the effort to integrate MDN into
the marketing technology and processes. He helped with our move to
Docker-based development and deployment, implemented
demo deploys, advocated for
a read-only and statically-generated deployment, and worked out details of
the go-to-AWS strategy, such as file serving and the master database
transfer. Josh keeps up to date on the infrastructure community, and knows
what tech is reliable, what the community is excited about, and what the
next best practices will be.
- Dave Parfitt did a lot of the heavy lifting on
the AWS transition, from
demo instances, through
maintenance mode and staging deployments, and all the way to a smooth
production deployment. He
figured out database initialization,
implemented the redirects, and
dark corners of unicode filenames.
He consistently does what need to be done, then goes above and beyond by
refining the process, writing excellent documentation, and automating
- Jon Petto introduced and integrated
allowing us to experiment with in-content changes in a lightweight, secure
- Giorgos Logiotatidis’s
Jenkins scripts and workflows
are the foundation of
MDN’s Jenkins integration,
used to automate our tests and AWS deployments.
- Paul McLanahan helped review PRs when we had a
single backend developer. His experience migrating
bedrock to AWS was invaluable, and
django-redirect-urls made it
possible to migrate away from Apache and get 10 years of redirects under
- Schalk Neethling reviewed front-end
code when we were down to one front-end developer. He implemented the
interactive examples from
prototype to production, and joined the MDN team when
transitioned to bedrock.
- Ben Sternthal made the transition
into Marketing possible. He made us feel welcome from day one, hired
some amazing contractors
to help with the dark days of the
2016 spam attack, hired
Ryan Johnson, and
worked for the resources and support to move to AWS. He created a space where
developers could talk about what is important to us, where we spent time
and effort on technical improvements and career advancement, and where
technical excellence was balanced with features and experiments.
MDN is on a firmer foundation after the time spent in MozMEAO, and is ready
for the next chapter in its 13 year history.
Ryan Johnson, Schalk Neethling, and I will join the Advanced Development team
in Emerging Technologies, reporting to
Faramarz Rashed. The Advanced Development team
has been working on various ET projects, most recently
an Internet of Things (IoT) project that is focused on decentralization,
security, privacy, and interoperability. It’s a team that is focused on
getting fresh technology into users’ hands. This is a great environment for
the next phase of MDN, as we build on the more stable foundation and expand our
We’re traveling to the Mozilla Paris Office
in March. We’ll have team meetings on Tuesday, March 13 through Thursday,
March 15, to plan for the next three months and beyond.
From Friday, March 16 through Sunday, March 18, we’ll have the third
Hack on MDN event. The last one
2015 in Berlin,
and the team is excited to return to this format. The focus of the Paris
event will be the
Browser Compat Data project.
We expect to build some tools using the data, alternative displays of
compat information, and improve the migration and review processes.
One of our goals for the year is to improve page load times on MDN. We’re
building on a similar SEO project last year, and looking for an external
expert to measure MDN’s performance and recommend next steps. Take a look at
Request for Proposal.
We plan to select the top bidders by March 30, 2018.
Here’s what happened on the MozMEAO SRE team from February 16 - February 28th.
Most of our recent efforts have been related to the SUMO migration to AWS. We’ll be running the stage and production environments in our Oregon-A and Oregon-B clusters, with read-only failover in Frankfurt.
As you may have seen in several of our
we’re moving all of our webapp hosting from Deis to Kubernetes (k8s). As part of that
we’ve also been doing some additional thinking about the security of our deployments.
One thing we’ve not done as good a job as we should is with Django’s
We should have been adding all possible hosts to that list, but it seems we used to occasionally
leave it set to
['*']. This isn’t great, but also isn’t the end-of-the-world since
we don’t knowingly construct URLs using the info sent via the
Host header. In an effort
to cover all bases we’ve decided to improve this. Unfortunately our particular combination
of technologies doesn’t make this as
easy as we thought it would (story of our lives).
AWS ELB Health Checks
Here’s the thing: Amazon Web Services’ (AWS) Elastic Load Balancers (ELB) do not have many configuration options for
their health checks. These checks ensure that your app on a particular node in your cluster
is working as expected. If the check fails the ELB will remove the node from the list of nodes
to which it will route requests for your app. However, because it’s hitting the nodes directly
it doesn’t rely on DNS and directly requests the IP address and port, and it doesn’t allow you
to specify custom headers (e.g. the
Host header). It also can’t do HTTPS because we terminate
TLS connections at the ELB, so the app nodes speak only plain HTTP back to the ELB. All of that
means that our health check endpoint needs to do two unique things: allow HTTP connections and
allow the IP address that the ELB requests as a valid
Host header. The first bit is easy enough
when using Django’s in-built
SecurityMiddleware since it supports the
setting. It’s this second requirement that gets more interesting when combined with k8s.
The way I understand it (and I’m admittedly no expert) is that k8s (at least the way we use it)
sets up a NodePort per app (or namespace). To hit that app you can hit any node in the cluster
at that port and that node will route you to one of the nodes that is running a pod
for that app. The important bit for us is that the node that serves this request is not
necessarily the one that the ELB sent it to. So the
Host header may contain an IP address for the
node that was initially hit, but not necessarily for the node that serves the request. This means that we can’t
simply add the IP of the host to the
ALLOWED_HOSTS list when the app starts. We could get more info
from AWS’ metadata service endpoint, but for security reasons we block that service from all of our
So, the approach could then be to simply add all of the IPs for all of the nodes in the cluster to the
ALLOWED_HOSTS setting and call it done. The problem with this happens when there is a scaling event.
When a node is killed and a new one started, or the cluster is scaled to include more nodes, you’d need
to have a way to inform every running pod of this change so they could get the new list of IPs. If they
didn’t update the list the new node(s) could be immediately excluded from the cluster because health checks would
return 400s since their IP (host) would not be allowed by Django.
The way we decided to solve this was by implementing a Django middleware that would allow a range of IP
addresses defined by a CIDR (Classless Inter-Domain Routing). We’ve released this middleware in a
Django package called django-allow-cidr. The way it works is to store the normal hosts you’ve set
ALLOWED_HOSTS setting, change that setting to
['*'] in order to bypass Django’s default
host header checking in the
HttpRequest.get_host() method, and do the checking itself.
It does this checking via the same methods as Django would have, but if those methods fail it does
a secondary check using the IP ranges you’ve defined in an
ALLOWED_CIDR_NETS setting. It creates
netaddr.IPNetwork instances from the CIDRs in that list and will check any host that isn’t valid
based on your original
ALLOWED_HOSTS setting. Failing both of those checks will result in an
immediate return of a 400 response.
That was a long way to go to get to some simple health checking, but we believe it was the right move for the
reliability and security of our Django apps hosted in our k8s infrastructure on AWS. Please check out the
repo for django-allow-cidr on Github if you’re interested in the code. Our hope is that releasing this as
a general use package will help others that find themselves in our situation, as well as helping ourselves
to do less copypasta coding around our various web