MozMEAO SRE Status Report - 5/16/2017

Here’s what happened on the MozMEAO SRE team from May 9th - May 16th.

Current work

Bedrock (mozilla.org)

Work continues on moving Bedrock to our Kubernetes infrastructure.

Postgres/RDS provisioning

A Postgres RDS instance has already been provisioned in us-east-1 for our Virginia cluster, and another was created in ap-northeast-1 to support the Tokyo cluster. Additionally, development, staging, and production databases were created in each region. This process was documented here.

Elastic Load Balancer (ELB) provisioning

We’ve automated the creation of ELB’s for Bedrock in Virginia and Tokyo. There are still a few more wrinkles to sort out, but the infra is mostly in place to begin The Big Move to Kubernetes.

MDN

Work continues to analyze the Apache httpd configuration from the current SCL3 datacenter config.

Downtime incident 2017-05-13

On May 13th, 2017 22:49 -22:55, New Relic reported that MDN was unavailable. The site was slow to respond to page views, and was running long database queries. Log analysis show a security scan of our database-intensive endpoints.

On May 14th, 2017, there were high I/O alerts on 3 of the 6 production web servers. This was not reflected in high traffic or a decrease in responsiveness.

Basket

The FxA team would like to send events (FXA_IDs) to Basket and Salesforce, and needed SQS queues in order to move forward. We automated the provisioning of dev/stage/prod SQS queues, and passed off credentials to the appropriate engineers.

The FxA team requested cross AWS account access to the new SQS queues. Access has been automated and granted via this PR.

Snippets

Snippets Stats Collection Issues 2017-04-10

A planned configuration change to add a Route 53 Traffic Policy for the snippets stats collection service caused a day’s worth of data to not be collected due to a SSL certificate error.

Careers

Autoscaling

In order to take advantage of Kubernetes cluster and pod autoscaling (which we’ve documented here), app memory and CPU limits were set for careers.mozilla.org in our Virginia and Tokyo clusters. This allows the careers site to scale up and down based on load.

Acceptance tests

Giorgos Logiotatidis added acceptance tests, which contains a simple bash script and additional Jenkinsfile stages to check if careers.mozilla.org pages return valid responses after deployment.

Downtime incident 2017-04-11

A typo was merged and pushed to production and caused a couple of minutes of downtime before we rolled-back to the previous version.

Decommission openwebdevice.org status

openwebdevice.org will remain operational in http-only mode until the board approves decommissioning. A timeline is unavailable.

Future work

Nucleus

We’re planning to move nucleus to Kubernetes, and then proceed to decommissioning current nucleus infra.

Basket

We’re planning to move basket to Kubernetes shortly after the nucleus migration, and then proceed to decommissioning existing infra.

MozMEAO SRE Status Report - 5/9/2017

Here’s what happened on the MozMEAO SRE team from May 3rd - May 9th.

Current work

Bedrock (mozilla.org)

Bedrock multi-region RDS provisioning

Work continues to move Bedrock from Deis 1/Fleet to Kubernetes. The team has implemented Terraform automation to provision RDS instances in multiple regions.

Demo deployments

Jenkins deployments have been restructured, and demos now build in main pipeline. This was a meaty PR from pmac, and a motivation to upgrade Deis Workflow to the latest version (more info below).

Next actions:

  • create persistent development, staging, and production applications using RDS (Postgres)
  • enable deployments to new apps in Jenkins
  • Cloudfront distribution and integration testing

MDN

We’re working on migrating custom Apache config for MDN directly in Kuma/Django for the eventual move to AWS. Most of the Apache rewrites/redirects have been implemented in Kuma, with only a few remaining.

Basket

The FxA team would like to send events (FXA_IDs) to Basket and Salesforce, and need SQS queues in order to move forward. We automated the provisioning of dev/stage/prod SQS queues, and passed off credentials to the appropriate engineers.

Kubernetes / Deis Workflow

Deis Workflow has been upgraded to latest version (2.14.20) in Virginia and Tokyo. We hit a few snags during the first upgrade, as our Workflow install has some customization that wasn’t applied. Subsequent upgrades should be easier, as we’ve automated the process via a script (with minor tweaks in this PR).

Snippets

Snippets-stats is running in Tokyo and Virginia.

snippet-stats was already running on our Deis 1 clusters in Oregon and Ireland, however Giorgos enabled it on our Virginia and Tokyo Kubernetes clusters.

Issues with HTTP_X_FORWARDED_PROTO header not set for for snippets-*.virginia.moz.works

We created a generic http to https redirector service that runs in Kubernetes. This allows Kubernetes to handle forwarding http to https for us without having custom implementations in each application. However, there remained an issue in our current ELB setup where HTTP_X_FORWARDED_PROTO was not set, and thus Django cannot be aware whether a connection is secure or not.

pmac has implemented an alternative to X-Forwarded-Proto using an HTTPS env var and a SWGIRequest subclass.

Thanks to Giorgos and pmac for their hard work on this!

Decommission webwewant.mozilla.org

webwewant.mozilla.org has been decommissioned. All requests to webwewant.mozilla.org are now being forwarded to https://www.mozilla.org.

Future work

Decommission openwebdevice.org

Waiting for some internal communications before moving forward.

Nucleus

We’re planning to move nucleus to Kubernetes, and then proceed to decommissioning current nucleus infra.

Basket

We’re planning to move basket to Kubernetes shortly after the nucleus migration, and then proceed to decommissioning existing infra.

New Kubernetes cluster

We’ll be creating a new Kubernetes cluster in Portland so we can take advantage of EFS to support MDN in that region. We currently run many of our services from Portland, Virginia, and Ireland. The new cluster will be created in an entirely new VPC, and existing resources will not be shared.

MozMEAO SRE Status Report - 5/2/2017

Here’s what happened on the MozMEAO SRE team from April 25th - May 2nd.

Current work

Bedrock (mozilla.org)

Bedrock CDN

The SRE team is currently evaluating different CDN options for Bedrock. The CDN that we choose needs to have support for the Accept-Language header, which CloudFront and Fastly both appear to provide. Next up is testing CloudFront with a bedrock demo deployment.

Bedrock moving to Kubernetes

Our Fleet and Deis 1 infrastructure will eventually be replaced with Kubernetes and Deis Workflow. pmac and jgmize have bedrock deployed in our Virginia Kubernetes cluster. Minor issues with https redirects were uncovered, but have been resolved. Next steps are getting integration tests working and trying Cloudfront with this deployment.

Bedrock log analysis

The bedrock durable team is looking to gather some traffic metrics for /firefox, and using AWS Athena to query the data in the S3 bucket populated by Papertrail looks like a viable solution.

old resources in SCL3

pmac is going to followup on moving old SCL3 Bedrock resources to an S3 bucket for backup.

MDN

The SRE team has been working on the analysis of the existing SCL3 MDN deployment and it’s migration to AWS.

Below are some issues and PR’s related to this work:

New Relic Synthetics CLI tools

Giorgos is working on an unofficial New Relic Synthetics CLI tool:

“NeReS is a cli tool to manage NewRelic Synthetics monitors with a Synthetics Lite account (Pro should work too). The tool emulates the actions of a user in the browser and doesn’t use the Synthetics API since that’s only available to the Pro accounts.”

The project lives on Github.

Future work

Decommission webwewant.mozilla.org

We’ll be decommissioning webwewant.mozilla.org. A webops bug has been filed to redirect webwewant requests to mozilla.org.

Decommission openwebdevice.org?

Looking into possibility of shutting down this site. Waiting for some internal communications before moving forward.

Nucleus

We’re planning to move nucleus to Kubernetes, and then proceed to decommissioning current nucleus infra.

Basket

We’re planning to move basket to Kubernetes shortly after the nucleus migration, and then proceed to decommissioning existing infra.

Snippets

Status unchanged since last week. Giorgos is looking at snippets-stats to see if it’s behaving correctly. The snippets-stats Route53 routing policy currently points at the Deis 1 deployment due to low stats alerts in new K8s environment.

New Kubernetes cluster

We’ll be creating a new Kubernetes cluster in Portland so we can take advantage of EFS to support MDN in that region. We currently run many of our services from Portland, Virginia, and Ireland. The new cluster will be created in an entirely new VPC, and existing resources will not be shared.

Kuma Report, April 2017

Here’s what happened in April in Kuma, the engine of MDN:

  • Explored faster paths to AWS
  • Maintained quality with new robots
  • Improved the macros dashboard
  • Goodbye BrowserCompat API, Hello Browser Compat Data
  • Shipped tweaks and fixes

Here’s the plan for May:

  • Experiment with on-site interactive examples
  • More legacy cleanup and fixes
  • Ship the sample database

Done in April

Explored Faster Paths to AWS

The AWS Migration Plan details how Kuma and its backing services will need to evolve to fit into a cloud architecture. However, there are good reasons to make the switch quickly with a non-ideal architecture. We can minimize the painful transition where we are supporting both the current datacenter and AWS. Some changes are difficult to do now, and could be easier in the single AWS environment.

jgmize and metadave, the MozMEAO Site Reliability Engineers (SREs), are leading this effort to find a faster path to AWS. You can follow their work in the MDN AWS Architecture Eval & Recommendation milestone in the mozmar/infra repository. MDN developers are still involved when a code change is needed, like escattone’s work to disable the contributor bar in maintenance mode.

Maintained Quality with New Robots

Linters are pedantic robots that find syntax errors, style mistakes, and language misuse. They can be jerks about the small stuff so that code reviewers can focus on the ideas in the new code. We added several new linters to our development process:

Improved the Macros Dashboard

We’ve shipped an improved macros dashboard which lets KumaScript authors see how often macros are used, access the macro source, and find documents that use the macro. There are over 90 macros not used on any page, so there are opportunities for deprecating and removing macros.

Goodbye BrowserCompat API, Hello Browser Compat Data

BrowserCompat was a 2014 project to build an API to serve browser compatibility data, for MDN and for other users. An API was a development-heavy solution, and we had to abandon it in 2016 when we lost resources. This month, stephaniehobson removed the MDN assets supporting this project, so we can start shutting down the service.

For the next iteration of this idea, we’re hand-coding JSON structures with the Browser Compatibility data, and working to make MDN the first consumer of this data. You can follow the project on the browser-compat-data repository.

Shipped Tweaks and Fixes

Here’s some other highlights from the 41 merged Kuma PRs in April:

  • PR 4144: In the page history, enable comparing the first translation to the English source (safwanrahman).
  • PR 4158: When creating a sub-page of a redirected page, create the sub-page at the new URL (safwanrahman).
  • PR 4171: Enable Bulgarian as a supported language. This was a small code change backed by a lot of work from kberov and jswisher.
  • PR 4173: Deprecate KumaScript editing on MDN, and reduce the static assets (stephaniehobson).
  • PR 4181: For page watch emails for a first translation, show a diff to the original English text, and prepare for more email changes (jwhitlock).
  • PR 4182: Banned user profiles are now 404 Not Found, not 403 Forbidden (safwanrahman).
  • PR 4190: The Insert Live Sample editor action inserts better section titles (sheppy).

There are some new contributors in the 16 merged KumaScript PRs in April:

Planned for May

Here’s what we’re planning to ship in May:

Experiment with On-site Interactive Examples

In April, we shipped a first iteration of changing examples on MDN. wbamberg and Elchi3 created alternate versions of JavaScript and CSS reference pages with short examples at the top of the page. We’ve added an A/B test to see if there is a behavioral difference for users seeing these examples.

In May, schalkneethling will refine wbamberg’s prototype code to add interactive examples for CSS and JavaScript. This will allow users to test their understanding by making changes to the short examples and seeing the results without leaving MDN. We’ll also ship this as an A/B test, and analyze the results before planning further rollouts.

More Legacy Cleanup and Fixes

We’re planning on reducing and removing more legacy features, to simplify the Kuma project and make room for new development:

  • Remove the Vagrant development environment.
  • Remove the Ansible provisioning system, used by Vagrant and TravisCI.
  • Rework Zones, moving the styles to standard assets and simplifying configuration.
  • Fix several bugs and misfeatures in client-side drafts.
  • Rewrite more tests in the py.test style.
  • Improve the KumaScript engine, macros, and development process.

Ship the Sample Database

The Sample Database has been promised every month since October 2016, and has slipped every month. We don’t want to break the tradition: the sample database will ship in May, for the anniversary of the project. See PR 4076 for the remaining tasks.

Kuma Report, March 2017

Here’s what happened in March in Kuma, the engine of MDN:

  • Shipped content experiments framework
  • Merged read-only maintenance mode
  • Shipped tweaks and fixes

Here’s the plan for April:

  • Clean up KumaScript macro development
  • Improve and maintain CSS quality
  • Ship the sample database

Done in March

Content Experiments Framework

We’re planning to experiment with small, interactive examples at the top of high-traffic reference pages. We want to see the effects of this change, by showing the new content to some of the users, and tracking their behavior. We shipped a new A/B testing framework, using the Traffic Cop library in the browser. We’ll use the framework for the examples experiment, starting in April.

Read-Only Maintenance Mode

We’ve merged a new maintenance mode configuration, which keeps Kuma running when the database connection is read-only. Eventually, this will allow MDN content to remain available when the database is being updated, and lead to new distributed architectures. In the near term, we’ll use it to test our new AWS infrastructure running production backups, and eventually against off-peak MDN traffic.

Shipped Tweaks and Fixes

Here’s some other highlights from the 15 merged Kuma PRs in March:

KumaScript continues to be busy, with 19 merged PRs. There were some PRs from new contributors:

Planned for April

We had a productive work week in Toronto. We decided that we need to make sure we’re paying down our technical debt regularly, while we continue supporting improved features for MDN visitors. Here’s what we’re planning to ship in April:

Clean Up KumaScript Macro Development

KumaScript macros have moved to GitHub, but ghosts of the old way of doing things remain in Kuma, and the development process is still tricky. This month, we’ll tackle some of the known issues:

  • Remove the legacy macros from MDN (stuck in time at November 2016)
  • Remove macro editing from MDN
  • Update macro searching
  • Start on an automated testing framework for KumaScript macros

Improve and Maintain CSS Quality

We’re preparing for some future changes by getting our CSS in order. One of the strategies will be to define style rules for our CSS, and check that existing code is compliant with stylelint. We can then enforce the style rules by detecting violations in pull requests.

Ship the Sample Database

The Sample Database has been promised every month since October 2016, and has slipped every month. We don’t want to break the tradition: the sample database will ship in April. See PR 4076 for the remaining tasks.