Bedrock: The SQLitening

On its face www.mozilla.org doesn’t look like it would be a complex application to write, maintain, or run. But when you throw over 100 million unique visitors per week at any site it can complicate things quickly. Add to that translations of the content into over 100 languages and you can start to get the idea of where it might get interesting. So we take every opportunity to simplify and reduce hosting complexity and cost we can get. This is the place from which the idea to switch to using SQLite for our database needs in production was born.

The traditional answer to the question “should we use SQLite for our web application in production?” is an emphatic NO. But, again, bedrock is different. It uses its database as a read-only data store as far as the web application is concerned. We run a single data updater process (per cluster) that does the writing of the updates to the DB server that all of the app instances use. Most of bedrock is static content coded directly into templates, but we use the database to store things like product release notes, security advisories, blog posts, twitter feeds, and the like; basically anything that needs updating more often than we deploy the site. SQLite is indeed a bad solution for a typical web application which is writing and reading data in its normal function because SQLite rightly locks itself to a single writer at a time, and a web app with any traffic almost certainly needs to write more than one thing at a time. But when you only need to read data then SQLite is an incredibly fast and robust solution to data storage.

Data Updates

The trick with a SQLite store is refreshing the data. We do still need to update all those bits of data I mentioned before. Our solution to this is to keep the aforementioned single process updating the data, but this time it will update a local SQLite file, calculate a hash of said file, and upload the database and its metadata (a JSON file that includes the SHA256 hash) to AWS S3. The Docker containers for the web app will also have a separate process running that will check for a new database file on a schedule (every 5 min or so), compare its metadata to the one currently in use, download the newer database, check its hash against the one from the metadata to ensure a successful download, and swap it with the old file atomically with the web app none the wiser. Using Python’s os.rename function to swap the database file ensures an atomic switch with zero errors due to a missing DB file. We thought about using symlinks for this but it turns out to be harder to re-point a symlink than to just do the rename which atomically overwrites the old file with the new (I’m pretty sure it’s actually just updating the inode to which the name points but I’ve not verified that).

When all of this is working it means that bedrock no longer requires a database server. We can turn off our AWS RDS instances and never have to worry about DB server maintenance or downtime. The site isn’t all that much faster since like I said it’s mostly spending time rendering Jinja templates, but it is a lot cheaper to run and less likely to go down. We are also making DB schema changes easier and more error-free since the DB filenames include the git hash of the version of bedrock that created it. This means that the production Docker images contain an updated and migrated database file, and it will only download an update once the same version of the site is the one producing database files.

And production advantages aren’t the only win: we also have a much more simple development bootstrap process now since getting all of the data you need to run the full site is a simple matter of either running bin/run-db-download.py or pulling the prod docker image (mozorg/bedrock:latest) which will contain a decently up-to-date database and the machinery to keep it updated that requires no AWS credentials since the database is publicly available.

Verifying Updates

Along with actually performing the updates in every running instance of the site we also need to be able to monitor that said updates are actually happening. To this end we created a page on the site that will give us some data on when the last time that instance ran the update, the git hash of bedrock that is currently running, the git hash used to create the database in use, and how long ago said database was updated. This page will also respond with a 500 code instead of the normal 200 if the DB and L10n update tasks happened too long ago. At the time of writing the updates happen every 5 minutes, and the page would start to fail at 10 minutes of no updates. Since the updates and the site are running in separate processes in the Docker container, we need a way for the cron process to communicate to the web server the time of the last run for these tasks. For this we decided on files in /tmp that the cron jobs will simply touch, and the web server can get the mtime (check out the source code for details).

To actually monitor this view we are starting with simply using New Relic Synthetics pings of this URL at each of our clusters (currently oregon-b, tokyo, and frankfurt). This is a bit suboptimal because it will only be checking whichever pod happens to respond to that particular request. In the near future our plan is to move to creating another process type for bedrock that will query Kubernetes for all of the running pods in the cluster and ping each of them on a schedule. We’ll then ping Dead Man’s Snitch (DMS) on every fully successful round of checks, and if they fail more than a couple of times in a cluster we’ll be notified. This will mean that bedrock will be able to monitor itself for data update troubles. We also ping DMS on every database update run, so we should know quickly if either database uploading or downloading is having trouble.

Conclusions

We obviously don’t yet know the long-term effects and consequences of this change (as of writing it’s been in production less than a day), but for now our operational complexity and costs are lower. I feel confident calling it a win for our deployment reliability for now. Bedrock may eventually move toward having a large part of it pre-generated and hosted statically, but for now this version feels like the one that will be as robust, resilient, and reliable as possible while still being one big Django web application.

MDN Changelog for February 2018

Here’s what happened in February to the code, data, and tools that support MDN Web Docs:

Here’s the plan for March:

Done in February

Migrated 14% of compatibility data

In February, we asked the MDN community to help convert compatibility data to the browser-compat-data repository. Florian Scholz led this effort, starting with a conference talk and blog post last month. He created GitHub issues to suggest migration tasks, and added a call to action on the old pages:

call-to-action

The response from the community has been overwhelming. There were 203 PRs merged in February, and 96 were from 23 first-time contributors. Existing contributors such as Mark Boas, Chris Mills, and wbamberg kept up their January pace. The PRs were reviewed for the correctness of the conversion as well as ensuring the data was up to date, and Florian, Jean-Yves Perrier, and Joe Medley have done the most reviews. In February, the project jumped from 43% to 57% of the data converted, and the data is better than ever.

There are two new tools using the data. SphinxKnight is working on compat-tester, which scans an HTML, CSS, or Javascript file for compatibility issues with a user-defined set of browsers. K3N is working on mdncomp, which displays compatibility data on the command line:

mdncomp

If you have a project using the data, let us know about it!

Improved and Extended Interactive Examples

We continue to improve and expand the interactive examples, such as a clip-path demo from Rachel Andrew:

clip-path

We’re expanding the framework to allow for HTML examples, which often need a mix of HTML and CSS to be interesting. Like previous efforts, we’re using user testing to develop this feature. We show the work-in-progress, like the <table> demo, to an individual developer, watch how the demo is used and ask for feedback, and then iterate on the design and implementation.

html-table

The demos have gone well, and the team will firm up the implementation and write more examples to prepare for production. The team will also work on expanding test coverage and formalizing the build tools in a new package.

Prepared for a CDN and Django 1.11

We made many changes last month to improve the performance and reliability of MDN. They worked, and we’ve entered a new period of calm. We’ve had a month without 3 AM downtime or performance alerts, for the first time since the move to AWS. The site is responding more smoothly, and easily handling MDN’s traffic.

new-relic-calm

This has freed us to focus on longer term fixes and on the goals for the quarter. One of those is to serve MDN from behind a CDN, which will further reduce server load and may have a huge impact on response time. Ryan Johnson is getting the code ready. He switched to Django’s middleware for handling ETag creation (PR 4647), which allowed him to remove some buggy caching code (PR 4648). Ryan is now working through the many endpoints, adding caching headers and cleaning up tests (PR 4676, PR 4677, and others). Once this work is done, we’ll add the CDN that will cache content based on the directives in the headers.

My focus has been on the Django 1.11 upgrade, since Django 1.8 is scheduled to lose support in April. This requires updating third-party libraries like django-tidings (PR 4660) and djangorestframework (PR 4664 from Safwan Rahman). We’re moving away from other requirements, such as dropping dbgettext (PR 4669). We’ve taken care of the most obvious upgrades, but there are 142,000 lines of Python in our libraries, so we expect more surprises as we get closer to the switch.

Once the libraries are compatible with Django 1.11, the remaining issues will be with the Kuma codebase. Some changes are small and easy, such as a one-liner in PR 4684. Some will be quite large. Our code that serves up locale-specific content, such as reverse and LocaleURLMiddleware, are incompatible, and we’ll have to swap some of our oldest code for Django’s version.

Shipped Tweaks and Fixes

There were 413 PRs merged in February:

147 of these were from first-time contributors:

Other significant PRs:

Planned for March

We’ll continue with the compatibility migration, interactive examples, the CDN, and the Django 1.11 migration in March.

Move developers to Emerging Technologies

Starting March 2, the MDN developers move from Marketing to Emerging Technologies. We’ll be working on the details of this transition in March and the coming months. That will include planning a infrastructure transition, and finding a new home for the MDN Changelog.

Stephanie Hobson and I joined Marketing Engineering and Operations in March 2016, back when it was still Engagement Engineering. EE was already responsible for 50% of Mozilla’s web traffic with www.mozilla.org, and adding support.mozilla.org (34%) and developer.mozilla.org (16%) put 99% of Mozilla’s web presence under one engineering group. MDN benefited from this amazing team in many ways:

  • Josh Mize led the effort to integrate MDN into the marketing technology and processes. He helped with our move to Docker-based development and deployment, implemented demo deploys, advocated for a read-only and statically-generated deployment, and worked out details of the go-to-AWS strategy, such as file serving and the master database transfer. Josh keeps up to date on the infrastructure community, and knows what tech is reliable, what the community is excited about, and what the next best practices will be.
  • Dave Parfitt did a lot of the heavy lifting on the AWS transition, from demo instances, through maintenance mode and staging deployments, and all the way to a smooth production deployment. He figured out database initialization, implemented the redirects, and tackled the dark corners of unicode filenames. He consistently does what need to be done, then goes above and beyond by refining the process, writing excellent documentation, and automating whenever possible.
  • Jon Petto introduced and integrated Traffic Cop, allowing us to experiment with in-content changes in a lightweight, secure way.
  • Giorgos Logiotatidis’s Jenkins scripts and workflows are the foundation of MDN’s Jenkins integration, used to automate our tests and AWS deployments.
  • Paul McLanahan helped review PRs when we had a single backend developer. His experience migrating bedrock to AWS was invaluable, and his battle-tested django-redirect-urls made it possible to migrate away from Apache and get 10 years of redirects under control.
  • Schalk Neethling reviewed front-end code when we were down to one front-end developer. He implemented the interactive examples from prototype to production, and joined the MDN team when Stephanie Hobson transitioned to bedrock.
  • Ben Sternthal made the transition into Marketing possible. He made us feel welcome from day one, hired some amazing contractors to help with the dark days of the 2016 spam attack, hired Ryan Johnson, and worked for the resources and support to move to AWS. He created a space where developers could talk about what is important to us, where we spent time and effort on technical improvements and career advancement, and where technical excellence was balanced with features and experiments.

MDN is on a firmer foundation after the time spent in MozMEAO, and is ready for the next chapter in its 13 year history.

Ryan Johnson, Schalk Neethling, and I will join the Advanced Development team in Emerging Technologies, reporting to Faramarz Rashed. The Advanced Development team has been working on various ET projects, most recently Project Things, an Internet of Things (IoT) project that is focused on decentralization, security, privacy, and interoperability. It’s a team that is focused on getting fresh technology into users’ hands. This is a great environment for the next phase of MDN, as we build on the more stable foundation and expand our reach.

Meet in Paris for Hack on MDN

We’re traveling to the Mozilla Paris Office in March. We’ll have team meetings on Tuesday, March 13 through Thursday, March 15, to plan for the next three months and beyond.

From Friday, March 16 through Sunday, March 18, we’ll have the third Hack on MDN event. The last one was in 2015 in Berlin, and the team is excited to return to this format. The focus of the Paris event will be the Browser Compat Data project. We expect to build some tools using the data, alternative displays of compat information, and improve the migration and review processes.

Evaluate Proposals for a Performance Audit

One of our goals for the year is to improve page load times on MDN. We’re building on a similar SEO project last year, and looking for an external expert to measure MDN’s performance and recommend next steps. Take a look at our Request for Proposal. We plan to select the top bidders by March 30, 2018.

MozMEAO SRE Status Report - February 28, 2018

Here’s what happened on the MozMEAO SRE team from February 16 - February 28th.

Current work

support.mozilla.org (SUMO)

Most of our recent efforts have been related to the SUMO migration to AWS. We’ll be running the stage and production environments in our Oregon-A and Oregon-B clusters, with read-only failover in Frankfurt.

Django, K8s, and ELB Health checks

As you may have seen in several of our SRE status reports, we’re moving all of our webapp hosting from Deis to Kubernetes (k8s). As part of that we’ve also been doing some additional thinking about the security of our deployments. One thing we’ve not done as good a job as we should is with Django’s ALLOWED_HOSTS setting. We should have been adding all possible hosts to that list, but it seems we used to occasionally leave it set to ['*']. This isn’t great, but also isn’t the end-of-the-world since we don’t knowingly construct URLs using the info sent via the Host header. In an effort to cover all bases we’ve decided to improve this. Unfortunately our particular combination of technologies doesn’t make this as easy as we thought it would (story of our lives).

AWS ELB Health Checks

Here’s the thing: Amazon Web Services’ (AWS) Elastic Load Balancers (ELB) do not have many configuration options for their health checks. These checks ensure that your app on a particular node in your cluster is working as expected. If the check fails the ELB will remove the node from the list of nodes to which it will route requests for your app. However, because it’s hitting the nodes directly it doesn’t rely on DNS and directly requests the IP address and port, and it doesn’t allow you to specify custom headers (e.g. the Host header). It also can’t do HTTPS because we terminate TLS connections at the ELB, so the app nodes speak only plain HTTP back to the ELB. All of that means that our health check endpoint needs to do two unique things: allow HTTP connections and allow the IP address that the ELB requests as a valid Host header. The first bit is easy enough when using Django’s in-built SecurityMiddleware since it supports the SECURE_REDIRECT_EXEMPT setting. It’s this second requirement that gets more interesting when combined with k8s.

K8s Routing

The way I understand it (and I’m admittedly no expert) is that k8s (at least the way we use it) sets up a NodePort per app (or namespace). To hit that app you can hit any node in the cluster at that port and that node will route you to one of the nodes that is running a pod for that app. The important bit for us is that the node that serves this request is not necessarily the one that the ELB sent it to. So the Host header may contain an IP address for the node that was initially hit, but not necessarily for the node that serves the request. This means that we can’t simply add the IP of the host to the ALLOWED_HOSTS list when the app starts. We could get more info from AWS’ metadata service endpoint, but for security reasons we block that service from all of our nodes.

So, the approach could then be to simply add all of the IPs for all of the nodes in the cluster to the ALLOWED_HOSTS setting and call it done. The problem with this happens when there is a scaling event. When a node is killed and a new one started, or the cluster is scaled to include more nodes, you’d need to have a way to inform every running pod of this change so they could get the new list of IPs. If they didn’t update the list the new node(s) could be immediately excluded from the cluster because health checks would return 400s since their IP (host) would not be allowed by Django.

Enter django-allow-cidr

The way we decided to solve this was by implementing a Django middleware that would allow a range of IP addresses defined by a CIDR (Classless Inter-Domain Routing). We’ve released this middleware in a Django package called django-allow-cidr. The way it works is to store the normal hosts you’ve set in your ALLOWED_HOSTS setting, change that setting to ['*'] in order to bypass Django’s default host header checking in the HttpRequest.get_host() method, and do the checking itself. It does this checking via the same methods as Django would have, but if those methods fail it does a secondary check using the IP ranges you’ve defined in an ALLOWED_CIDR_NETS setting. It creates netaddr.IPNetwork instances from the CIDRs in that list and will check any host that isn’t valid based on your original ALLOWED_HOSTS setting. Failing both of those checks will result in an immediate return of a 400 response.

Conclusion

That was a long way to go to get to some simple health checking, but we believe it was the right move for the reliability and security of our Django apps hosted in our k8s infrastructure on AWS. Please check out the repo for django-allow-cidr on Github if you’re interested in the code. Our hope is that releasing this as a general use package will help others that find themselves in our situation, as well as helping ourselves to do less copypasta coding around our various web projects.

MozMEAO SRE Status Report - February 16, 2018

Here’s what happened on the MozMEAO SRE team from January 23 - February 16.

Current work

SRE general

Load Balancers

Cloudflare to Datadog service

  • The Cloudflare to Datadog service has been converted to use a non-helm based install, and is running in our new Oregon-B cluster.

Oregon-A cluster

  • We have a new Kubernetes cluster running in the us-west-2 AWS region that will run support.mozilla.org (SUMO) services as well as many of our other services.

Bedrock

  • Bedrock is moving to a “sqlitened” version in our Oregon-B Kubernetes cluster that removes the dependency on an external database.

MDN

  • The cronjob that performs backups on attachments and other static media broke due to a misconfigured LANG environment variable. The base image for the cronjob was updated and deployed. We’ve also added some cron troubleshooting documentation as part of the same pull request.

  • Safwan Rahman submitted an excellent PR to optimize Kuma document views 🎉🎉🎉.

support.mozilla.org (SUMO)

  • SUMO now uses AWS Simple Email Service (SES) to send email.
  • We’re working on establishing a secure link between SCL3 and AWS for MySQL replication, which will help us signficantly reduce the amount of time needed in our migration window.
  • SUMO is now using a CDN to host static media
  • We’re working on Python-based Kubernetes automation for SUMO based on the Invoke library. Automation includes web, cron and celery deployments, as well as rollout and rollback functionality.
  • Using the Python automation above, SUMO now runs in “vanilla Kubernetes” without Deis Workflow.