Bedrock - from code to production

Introduction

Bedrock is the code name for the project that runs www.mozilla.org. It is shiny, awesome, and open source. Mozilla.org is the site where everyone comes to download Firefox, and is also the main public facing web property for Mozilla as an organization. It represents who we are and what we do.

Bedrock is a large Django app that has many moving parts, with web pages translated into many different languages (99 locales at time of writing!). This post aims to follow a piece of code through the development lifecycle; from bug, to pull request, to code review, localization, and finally pushed live to production. Hopefully it may prove insightful and give some things to think about when requesting changes or creating new content.

Note: A presentation was given on this topic. The slides and video from that event are online.

The change request

All development work should start with filing a bug. Bugzilla is our tool of choice for tracking changes, and links to bugs can be found throughout our commit history for reference. This gives us a paper trail for when and why changes were made. For a site that has existed over many years, this is very useful.

A well written bug should provide a clear summary, and provide as much additional information that may be relevant to resolving the bug. If the bug is reporting an error or mistake on a web page, providing clear steps to reproduce is a great start to helping it get fixed quickly. MDN has an excellent set of guidelines for bug reporting with a lot more detail.

An example bug summary for creating a new web page might be like so:

Please create a new Firefox download page at the following URL:

https://www.mozilla.org/firefox/example-download-page/

Page assets are linked below:
[link to design file]
[link to copy file]

The download page needs to be live in production by: 1 April, 2017.
Locales required for translation on launch: en-US, de, fr, es-ES.

If a bug is requesting the creation of new content or updates to existing content, related design and copy assets should be linked directly in the bug. For time sensitive requests, a clear due date should also be provided.

If a change requires localization, it also needs to be clearly stated. Our web pages are translated by teams of volunteers, so adequate time and warning must be given if there are hard deadlines involved. We will not show untranslated text to non-English locales on our web pages.

Once a bug has been triaged and contains enough information to act on, it can be assigned to a developer. When a bug has been assigned, any changes or feedback should be added directly to the bug (email is a great place for information to get lost).

Creating a new page

When we’re ready to start working on our bug, we first create a new feature branch on the bedrock GitHub repo. This allows us work on the bug in parallel to all the other changes landing in the master branch of the the repository. You can learn more about git workflows using GitHub’s introduction guide.

Creating a new page on bedrock is pretty simple. All pages are built using the Jinja template engine, inheriting from a common base template. This base template scaffolds out everything we need to start creating a new page. Then it’s just a case of adding in content and any additional styling or behavior the page might need.

When we’re finished creating our new template and adding in content, our example download page might look something like this:

Example download page screenshot

Because our example page inherits from a common base template it automatically receives the following characteristics:

  • Responsive and mobile-first by default. All pages should be designed with responsiveness in mind. Designs that start with small (mobile) viewports first and then expand up toward desktop sizes work better than the other way around.
  • Page component styling using Pebbles; our own lightweight CSS framework. Pebbles consists of a shared library of styles for common elements that appear throughout the site (e.g. page headers, navigation, footers, email signup forms, download buttons). Working with common design components allows us to create and update pages quickly.
  • Pre-built for localization (l10n). All pages are coded with strings wrapped ready for l10n. This means that copy and visuals should be suitable for different languages and designed to work well with variable length strings.
  • Built-in platform detection for Firefox downloads. We go to great lengths to try and make sure users get the correct binary file for their operating system and language when clicking on a download button. All pages get this logic for free should they need it.
  • Privacy friendly. We work hard to make sure all our pages respect user privacy as much as possible. We support Do Not Track (DNT), so users with this enabled are not tracked in Google Analytics or entered into Firefox Desktop Attribution. They will also not be entered into A/B experiments using libraries such as Traffic Cop.
  • Security focused. We enforce an active Content Security Policy (CSP) that limits the scope of third-party content allowed to run on the site. This aims to protect our users from security attacks and also helps ensure the integrity of Firefox downloads. It is a critical feature for our visitors.
  • Built using progressive enhancement and with accessibility in mind. User experiences need to work well with keyboard, mouse and touch, and work well with screen reading software. Pages should still function or degrade gracefully even if JavaScript fails, or is disabled by the user.

A note about browser support

Bedrock currently supports first-class CSS and JavaScript for all major evergreen browsers, as well as for Internet Explorer 9 and upward. Internet Explorer 8 and below get a simplified, universal stylesheet. This ensures that content is readable and accessible, but nothing more.

Here’s how our example download page might look in old browsers:

Example download page screenshot in Internet Explorer 8

Not much to it right? The key here is that all the page information is still perfectly readable and accessible, meaning the user can still accomplish their goal. By providing a more basic set of styling to older browsers, we are free to use more modern web platform features in order to accomplish more sophisticated designs and experiences.

Special considerations for download pages

For critical download pages such as our example page, we still need to ensure that users can successfully download Firefox, even in older browsers. When the user clicks the download button, they still need to receive the correct file for their operating system and language.

Because of this added support requirement, download pages may require additional testing and QA. For really high traffic pages such as /firefox/new/, we may still make extra effort to provide a higher level of CSS support to older browsers. The pages won’t look exactly the same, but they should still degrade gracefully.

Localization (l10n)

Many mozilla.org pages are translated into multiple languages. While much of our marketing focus is primarily for English language locales first and foremost, non-English traffic actually makes up around 60% of our overall site traffic. Unless there is a specific request for a page to be in English only, we create all pages assuming they will be translated should our volunteer community wish to do the work.

Here’s how our example download page might look in Italian:

Example download page screenshot in Italian

Because our translations are done by volunteers, we try our best to minimize the number of string changes we ask them for. If a page is translated in over 40 locales and we change one string, that still requires 40 new translations to be made. If strings are changing every week, this can begin put a lot of strain on the goodwill of our community. To minimize this churn we try to:

  • Only change strings if absolutely necessary.
  • Try and batch up string changes to send out in one go.
  • Reuse existing strings that may already be translated (if similar to what’s being asked for).
  • For new or redesigned pages that need to be translated for a set launch date, we typically ask for a lead time of 3-4 weeks.

Of course, each time we change a string on an existing page, we can’t always wait 4 weeks before merging the change. For situations like this we often use an l10n conditional tag in the Jinja template. For example, if we wanted to change the main heading on our download page we could do:

<h1>
  {% if l10n_has_tag('page_heading_update') %}
    {{ _('Download Firefox today!') }}
  {% else %}
    {{ _('Get Firefox today') }}
  {% endif %}
</h1>

This allows locales to opt-in to the new translated string one by one, else they fall back to the original string.

We can also do this type of thing for entirely redesigned pages in our view logic, so locales only see the new design once they have it translated.

def example_download_page(request):
    locale = l10n_utils.get_locale(request)

    if lang_file_is_active('firefox/example-page/redesign', locale):
        template = 'firefox/example-page/redesign.html'
    else:
        template = 'firefox/example-page/index.html'

    return l10n_utils.render(request, template)

Doing this does still come with technical debt of course, as the old content, conditional logic and .lang files still need cleaning up later on, once everyone is using the new translations.

GitHub pull request

Once our page is coded and we’re ready for the next step, we can open a pull request on GitHub. Once this is done the following things typically happen:

  • Our continuous integration service, CircleCI, runs a series of automated checks on the pull request. This runs unit tests as well as checks the code for both syntax and style errors. Running these automated checks saves us significant time during code review by picking out routine errors.
  • The pull request must then be code reviewed by another human. Every code change gets reviewed by at least one other developer before merging.
  • If the change requires localization, strings are extracted from the branch and checked by our l10n team. If they look good, they are sent out to our volunteers for translation. Once this has begun, it is very bad to make a breaking change to a piece of copy.
  • Once the pull request is approved by a reviewer, it can then be merged into the master branch and automatically deployed to our dev environment.

Other useful things to know about

Demo servers

Sometimes we’d like to have a URL we can give people to demonstrate changes that aren’t yet live on the site and that aren’t yet ready to be merged into our master branch. We’ve enabled this via our Jenkins CI server and Github. A developer that would like to demo their changes has only to follow these steps:

  1. Push their git branch to the primary repository using a special branch name.
    • Branch name must start with demo__ and then be any letters, numbers, and dashes (e.g. demo__the-dude)
  2. Profit

That’s really it. Jenkins will build the demo and if it is successful a notification in our IRC channel will inform the developer of their new demo’s URL. Feel free to check out how that works if you’re curious.

A/B testing

For A/B testing bedrock has two options available, Traffic Cop and Optimizely. Jon wrote a great blog post on the pros and cons of using each solution so we won’t repeat them here. You should go read it.

Analytics

Analytics on bedrock are provided using Google Tag Manager (GTM). Most of bedrock’s shared components and pages are pre-built for analytics using common event handlers and data attributes. We can also create and fire custom events as required.

GTM on bedrock is implemented in such a way that it respects DNT, and user experiences should not break if tracking protection or content blocking software is installed.

Feature toggling

Product deployments are easier than ever for bedrock, but they’re not and likely will never be without risk. There is also much that could go wrong (at AWS or any of our networks) that could prevent a production deployment from being successful or timely. Therefore we have developed the ability to use environment variables as feature switches. These allow us to simply update the running environment of the bedrock servers via the Deis command-line utility, and the site will react to those changes without the need for us to deploy new code.

These switches are very useful in situations that require precise timing, or modification of the site on off hours when support for a failed deployment would be light to nonexistent. We’ve used them for announcements that need to be timed with announcement publications elsewhere, or stage events at gatherings like Mobile World Congress.

Deployment

We deploy production fairly often, and we deploy to our dev instance on every push to our master branch. We are capable of deploying production many times per day, and like to deploy at least once per day. When any of our development team is happy with what is on dev and ready to deploy to production, they have only to follow these steps:

  1. Check the deployment pipeline to make sure the latest master branch build was a success.
  2. Add a properly formatted git tag (e.g. 2017-02-28.1) to the latest master commit, and push that to the prod branch.
    • This can all be accomplished automatically by running bin/tag-release.sh --push from the project.

Once this is done, Jenkins will notice that a change has been made to the prod branch that has been appropriately tagged, and will start its deployment process:

  1. Build the necessary docker images
  2. Test said docker images
    • Python unit tests
    • Selenium-based smoke tests run against the app running locally
  3. Push docker images to public docker hub and our private registries
  4. Tell Deis to deploy the new docker images to the staging app in our Oregon AWS cluster
  5. Run our integration test suite against the staging instance in Oregon
    • Our full selenium test suite run against multiple browsers (Firefox, Chrome, IE, etc.) in Saucelabs.
  6. Repeat steps 4 and 5 for prod in Oregon, then for stage then prod in Ireland.

If any testing phase fails we are notified in IRC and the deployment is halted. The whole process usually takes around an hour.

A deployment to our dev instance is very similar to our production deployment (above). The primary differences are that it happens on every push to master, and fewer integration tests are run against the deployed app. Our dev instance is also slightly different in that the DEV setting is set to True, which primarily means that a development version of our localization files is used so that localizers have a stable site from which to view their work-in-progress, and all feature switches are on by default.

Environments

Our server infrastructure is based on AWS EC2 running CoreOS instances in a Deis v1 cluster. Those words may not mean much to you if you’re not into server ops, but it basically means that we use Amazon’s cloud computing resources to manage our servers, and Deis is a FOSS Heroku clone that allows us a developer friendly deployment flow. We hope to soon move to Deis v2 on top of Kubernetes while staying at AWS, but for now v1 is what we’re using.

We have three primary instances of the site: Dev, Stage, and Prod. They are all comprised of deployments in our 2 primary Deis clusters in Oregon (US) and Ireland (AWS us-west-2 and eu-west-1 regions respectively).

The primary domain for both stage and prod points to our CDN provider, which in turn points to another domain hosted in AWS Route53. This domain uses latency-based routing to send clients to the fastest servers from their location. So each node at the CDN should get routed to the closest server cluster to it. This also means that we get automatic fail-over in case one cluster goes down or starts throwing errors.

Conclusion

Whew! That’s a lot of information to take in. With any luck, our example page would now be magically deployed in production after going through the pipeline. Our process for deployment may sound a bit complicated, and that’s because it is. The reason we’ve built this kind of automation is so we can remove as many manual steps as possible, saving us time and also helping us to deploy with greater speed and confidence.

Kuma Report, February 2017

Here’s what happened in February in Kuma, the engine of MDN:

  • Added Demo deployments in AWS
  • Promoted the Sauce Labs partnership
  • Packaged MDN data
  • Shipped tweaks and fixes

Here’s the plan for March:

  • Ship read-only maintenance mode
  • Test examples at the top of reference pages
  • Ship the sample database

Done in February

Demo Deployments in AWS

Thanks to metadave and escattone, MDN staff can now deploy demo servers to AWS. Bedrock has had this feature for a while, and it is extremely useful when demonstrating a change for manual or automated testing. It is also one step closer to MDN being hosted in AWS.

Sauce Labs Partnership

Sauce Labs provides a platform to test your website across many OS and browser combinations, so that you can automate testing and find issues before your users do. Mozilla is partnering with Sauce Labs to provide a free trial of the service. We’re promoting this offer on the home page and our introduction to automated testing. This required cross-team collaboration from Vik Iya, Kadir Topal, Rachel Wong, and Arcadio Lainez, and was implemented on MDN by jpetto.

Packaged MDN data

The writers have continued to work on extracting MDN data to GitHub. The mdn/data repo has grown an excited community who are impatient to publish the data on npm. Thanks to Elchi3 and iamstarkov, we have a package.json file, and you can now run npm install mdn/data. Further work is planned to publish on npm, and to make this data (and the Browser Compatibility data) useful on MDN and in other projects.

Shipped Tweaks and Fixes

Other highlights from the 32 merged Kuma PRs in February:

  • PR 4141: Fix capitalization for Articles in need of review page (Tckool’s first PR!).
  • PR 4136: Remove TransactionTestCase, making tests twice as fast for local development (safwanrahman).
  • PR 4057, PR 4067, PR 4080, and PR 4088: Improve the page editing experience with CKEditor updates and tweaks (a2sheppy).
  • PR 4099: Remove Intern Tests, completing the transition to py.test for browser-based tests (stephaniehobson).
  • PR 4125: Expand docs for the MDN CI pipeline (escattone). We now have a short, documented process for running integration tests during deployments.
  • PR 4114: Move font-loading to client-side, improving performance and simplifying the backend (jpetto).
  • PR 4103: Fix bugs with case-insensitive tags (jwhitlock).

KumaScript continues to be busy, with 19 merged PRs contributed by Elchi3, SebastianZ, SphinxKnight, a2sheppy, chrisdavidmills, and jpmedley. MDN staff and core volunteers are becoming more experienced with GitHub, and at fixing git issues over IRC.

Planned for March

We’re headed to sunny Toronto, Ontario for a Spring work week, where we’ll plan Q2 2017 and beyond. We also plan to ship some features in March:

Read-Only Maintenance Mode

jpetto and escattone have been working on Read-Only Maintenance Mode, a Kuma configuration that works against a recent database backup, displaying MDN data but not allowing login or page editing. We’ll work with jgmize and metadave to deploy this mode to AWS in March, eventually testing with live MDN traffic during off-peak hours.

Examples at the top of reference pages

In the next few months, we’re going to experiment with small, interactive examples at the top of high-traffic reference pages, and collect qualitative and quantitative data on visitor reactions. This includes an A/B test of the changes, using the Traffic Cop library that we introduced a few weeks ago.

Ship the Sample Database

The Sample Database has been promised every month since October 2016, and has slipped every month. We don’t want to break the tradition: the sample database will ship in March. See PR 4076 for the remaining tasks.

Introducing the MozMEAO infrastructure repo

This is a quick introduction post from the MozMEAO Site Reliability Engineers. As SRE’s at Mozilla, Josh and I are responsible for infrastructure, automation, and operations for several sites, including mozilla.org and MDN.

We try to keep much of our work as public as possible, so we’ve created https://github.com/mozmar/infra to share some of our automation and tooling. We have additional internal repos to manage some of our private infrastructure as well.

Feel free to try out our scripts and tools, and let us know via Github issue or pull request if we’ve missed anything.

Kuma Report, January 2017

Here’s what happened in January in Kuma, the engine of MDN:

  • Upgraded to node.js v6
  • Reached next milestone on functional tests
  • Switched from Stylus to Sass for CSS
  • Published AWS Migration Plan
  • Shipped Tweaks and Fixes

Here’s the plan for February:

  • Ship the Sample Database
  • Demo deployments in AWS
  • Read-Only Maintenance Mode

Done in January

Upgraded to node.js v6

KumaScript, MDN’s rendering engine, runs on node.js, and we also use node.js-based tools in our static asset pipeline. We upgraded from v0.10 to v6, which will be supported under the Long-term Support policy until April 2019.

Ryan Johnson (rjohnson) updated the Kumascript engine, including switching from checked-in modules to package.json. I updated the Kuma side. We worked with Ryan Watson (rw0ts0n), and Eric Ziegenhorn (ericz) to update the 13 production servers, and to update the deployment process. It was a time-consuming update, but went smoothly for users, with a handful of rendering issues discovered after deployment.

We’re excited about Docker-based deployment, which will make similar updates easier in the future, and a KumaScript macros test suite, to detect rendering issues before they get to production.

Reached next milestone on functional tests

Stephanie Hobson (shobson) completed the conversion of the functional tests from Intern to py.test with the translation tests, and has submitted the pull request to remove the Intern tests. We now have a library of browser-based functional tests to verify that an integrated environment is serving MDN correctly. Giorgos Logiotatidis (giorgos) has re-written the Jenkins integration pipeline, providing a framework for automated acceptance testing. There’s more work to do, but we have a good foundation for automatically detecting more issues before they appear in production.

Switched from Stylus to Sass for CSS

Kuma first started using Stylus way back in 2013, when we started MDN redesign from the “black” design to the current “blue” design. At the time, use of CSS preprocessors was growing, and there wasn’t a clear winner. In a 2012 poll, 54% of developers had tried a preprocessor, and LESS was the most popular at 51%, probably due to its use in Twitter Bootstrap.

In 2017, it looks like Sass is the CSS preprocessor of choice. There are more tools and tutorials available. Twitter Bootstrap is now just Bootstrap, and has an official Sass port. We’re planning on a lot of front-end changes in 2017, so it is a good time to switch to a new toolset.

Stephanie Hobson (shobson) changed the Stylus files to Sass, and Jon Petto (jpetto) and Ryan Johnson (rjohnson) worked to validate the changes, and integrate Sass into the static asset pipeline. This included improving the build process for Docker, which can now be used more efficiently for front-end development.

Published AWS Migration Plan

The Kuma team has been working toward rehosting MDN for years, updating systems and modifying the software architecture to fit cloud computing standards. Soon, we’ll start moving services to AWS, with a goal of rehosting production in AWS later this year.

There are a lot of moving pieces, which we’ve cataloged in the AWS Migration Plan. Take a look to see what is coming, or if you are having trouble sleeping.

Shipped Tweaks and Fixes

Other highlights from January:

  • PR 4070: Improve error message when tag list is too long (gautamramk’s first PR!).
  • PR 4089: Add Bulgarian to the candidate languages, and enable on the staging server.
  • PR 4095: Add rel=”nofollow” to non-indexable links (the first of several PRs from Jon Petto (jpetto).

Planned for February

Ship the Sample Database

The Sample Database has been promised every month since October 2016, and has slipped every month. We don’t want to break the tradition: the sample database will ship in February. See PR 4076 for the remaining tasks and to download the beta sample database.

Demo deployments in AWS

We are working with Josh Mize (jgmize) and Dave Parfitt (metadave) to automate deployment of temporary instances of Kuma to AWS. This will be useful for demonstrating new code, as well as for load and integration testing. This is a first step toward deploying staging and production instances to AWS.

Read-Only Maintenance Mode

We are working on Read-Only Maintenance Mode, a Kuma configuration that works against a recent database backup, displaying MDN data but not allowing login or page editing. This will be useful for keeping content available during database maintenance and for load-testing. We also want to determine the effort needed to split Kuma between read-only and read-write instances, as a possible AWS deployment strategy.

Traffic Cop - Simple & lightweight A/B testing

Update - Traffic Cop has been released as a stand-alone library.

We recently added a home-grown A/B testing framework to bedrock, the codebase powering mozilla.org. We named it Traffic Cop, as most of our content experiments simply redirect users to a different URL.

Why did we build it?

Prior to Traffic Cop, we were using Optimizely to handle both visitor redirection and content changes. While Optimizely has been functionally sound, there are a number of downsides that we’ve been grumbling about for some time:

  1. Security — Optimizely is a potential XSS vector. This has nothing to do with Optimizely per se - any JavaScript loaded from a third-party presents this risk. Anything that increases the chances of a user downloading a compromised build of Firefox is something we want to avoid.
  2. Performance — To avoid content flicker, any JavaScript performing redirects should be loaded in the <head> of the document. All users are forced to download this render-blocking JavaScript payload (even those not chosen for the experiment), so it makes sense to keep it as small as possible. In addition to loading code for all running experiments (not just those targeting the current page), Optimizely also bundles a separate build of jQuery 1. As you might guess, this results in a rather large JavaScript bundle. (Around 200KB at last check.)
  3. Code Quality — Optimizely code must be written and reviewed within a textarea on a web page, making for more error-prone development and time-consuming code review.
  4. Cost — Optimizely is a paid service. We’re not crying poor here, but perhaps that money can be better spent elsewhere.

Have we cancelled our Optimizely account? No. Not all of our experiments are of the simple “redirect a visitor” variety. Optimizely is still providing some value for us, but we try to make sure all experiments that can use Traffic Cop do.

How does it work?

A visitor hits a URL running an experiment, e.g. https://www.mozilla.org/en-US/internet-health/. Traffic Cop picks a random number, and, if that random number falls within the range specified by a variation, redirects the visitor to that variation, e.g. https://www.mozilla.org/en-US/internet-health/?v=2.

Traffic Cop assumes all variations are loaded through a querystring parameter appended to the original URL. This keeps things simple, as no new URL patterns need to be defined (and later removed) for each experiment. We simply check for the querystring parameter (either in the view or in the template 2) and load different content accordingly. An added benefit of this approach is that we are free to make content changes in separate HTML, CSS, and JavaScript files, whereas Optimizely operates only via JavaScript DOM manipulation.

Implementing Traffic Cop requires two 3 other JavaScript files: one to configure the experiment, and MDN’s handy cookie framework. The configuration file is fairly straightforward. Simply instantiate a new Traffic Cop with your experiment configuration, and then initialize it.

// this is an example of a configuration file
// assume Traffic Cop & the MDN cookie helper scripts are already loaded
var wiggum = new Mozilla.TrafficCop({
  id: 'experiment-home-page-hero-image-fall-2016',
  variations: {
    'v=1': 20,
    'v=2': 30
  }
});

wiggum.init();

In the above example, the id parameter is the unique identifier placed in a cookie to determine if a user has already been chosen for a specific variation. The variations object has keys that correspond with the intended querystring value, and values that map to a percent chance of that variation being chosen. For example, a visitor would have a 30% chance of being redirected to {currentURL}?v=2.

Read the docs to see examples and get more technical details.

In summary, Traffic Cop allows us to write code in a text editor, review code in a pull request, and avoid heavy and potentially insecure third-party JavaScrit code injection for free.



1: This is technically avoidable by loading our version of jQuery in the head of all pages, but that would result in worse performance site-wide.

2: pmac wrote a really handy mixin and view to make working with Traffic Cop even easier.

3: Okay, yes, you looked at the source and saw Traffic Cop looks for a globally scoped function by the name of _dntEnabled. Guilty. However, Traffic Cop carries on just fine if _dntEnabled doesn’t exist, so simmer down. As you’ve probably already guessed, _dntEnabled is a function that checks the doNotTrack status of the visitor’s browser. Be a conscientious developer and respect this setting for your visitors as well.