Here’s what happened on the MozMEAO SRE team from May 9th - May 16th.
Current work
Bedrock (mozilla.org)
Work continues on moving Bedrock to our Kubernetes infrastructure.
Postgres/RDS provisioning
A Postgres RDS instance has already been provisioned in us-east-1 for our Virginia cluster, and another was created in ap-northeast-1 to support the Tokyo cluster. Additionally, development, staging, and production databases were created in each region. This process was documented here.
Elastic Load Balancer (ELB) provisioning
We’ve automated the creation of ELB’s for Bedrock in Virginia and Tokyo. There are still a few more wrinkles to sort out, but the infra is mostly in place to begin The Big Move to Kubernetes.
MDN
Work continues to analyze the Apache httpd configuration from the current SCL3 datacenter config.
On May 13th, 2017 22:49 -22:55, New Relic reported that MDN was unavailable. The site was slow to respond to page views, and was running long database queries. Log analysis show a security scan of our database-intensive endpoints.
On May 14th, 2017, there were high I/O alerts on 3 of the 6 production web servers. This was not reflected in high traffic or a decrease in responsiveness.
Basket
The FxA team would like to send events (FXA_IDs) to Basket and Salesforce, and needed SQS queues in order to move forward. We automated the provisioning of dev/stage/prod SQS queues, and passed off credentials to the appropriate engineers.
A planned configuration change to add a Route 53 Traffic Policy for the snippets stats collection service caused a day’s worth of data to not be collected due to a SSL certificate error.
Careers
Autoscaling
In order to take advantage of Kubernetes cluster and pod autoscaling (which we’ve documented here), app memory and CPU limits were set for careers.mozilla.org in our Virginia and Tokyo clusters. This allows the careers site to scale up and down based on load.
Acceptance tests
Giorgos Logiotatidis added acceptance tests, which contains a simple bash script and additional Jenkinsfile stages to check if careers.mozilla.org pages return valid responses after deployment.
Downtime incident 2017-04-11
A typo was merged and pushed to production and caused a couple of minutes of downtime before we rolled-back to the previous version.
Decommission openwebdevice.org status
openwebdevice.org will remain operational in http-only mode until the board approves decommissioning. A timeline is unavailable.
Future work
Nucleus
We’re planning to move nucleus to Kubernetes, and then proceed to decommissioning current nucleus infra.
Basket
We’re planning to move basket to Kubernetes shortly after the nucleus migration, and then proceed to decommissioning existing infra.
Here’s what happened on the MozMEAO SRE team from May 3rd - May 9th.
Current work
Bedrock (mozilla.org)
Bedrock multi-region RDS provisioning
Work continues to move Bedrock from Deis 1/Fleet to Kubernetes. The team has implemented Terraform automation to provision RDS instances in multiple regions.
Demo deployments
Jenkins deployments have been restructured, and demos now build in main pipeline. This was a meaty PR from pmac, and a motivation to upgrade Deis Workflow to the latest version (more info below).
Next actions:
create persistent development, staging, and production applications using RDS (Postgres)
enable deployments to new apps in Jenkins
Cloudfront distribution and integration testing
MDN
We’re working on migrating custom Apache config for MDN directly in Kuma/Django for the eventual move to AWS. Most of the Apache rewrites/redirects have been implemented in Kuma, with only a few remaining.
Basket
The FxA team would like to send events (FXA_IDs) to Basket and Salesforce, and need SQS queues in order to move forward. We automated the provisioning of dev/stage/prod SQS queues, and passed off credentials to the appropriate engineers.
snippet-stats was already running on our Deis 1 clusters in Oregon and Ireland, however Giorgos enabled it on our Virginia and Tokyo Kubernetes clusters.
Metrics have been validated for snippets-stats in Virginia and Tokyo.
Issues with HTTP_X_FORWARDED_PROTO header not set for for snippets-*.virginia.moz.works
We created a generic http to https redirectorservice that runs in Kubernetes. This allows Kubernetes to handle forwarding http to https for us without having custom implementations in each application. However, there remained an issue in our current ELB setup where HTTP_X_FORWARDED_PROTO was not set, and thus Django cannot be aware whether a connection is secure or not.
We’re planning to move nucleus to Kubernetes, and then proceed to decommissioning current nucleus infra.
Basket
We’re planning to move basket to Kubernetes shortly after the nucleus migration, and then proceed to decommissioning existing infra.
New Kubernetes cluster
We’ll be creating a new Kubernetes cluster in Portland so we can take advantage of EFS to support MDN in that region. We currently run many of our services from Portland, Virginia, and Ireland. The new cluster will be created in an entirely new VPC, and existing resources will not be shared.
Here’s what happened on the MozMEAO SRE team from April 25th - May 2nd.
Current work
Bedrock (mozilla.org)
Bedrock CDN
The SRE team is currently evaluating different CDN options for Bedrock. The CDN that we choose needs to have support for the Accept-Language header, which CloudFront and Fastly both appear to provide. Next up is testing CloudFront with a bedrock demo deployment.
Bedrock moving to Kubernetes
Our Fleet and Deis 1 infrastructure will eventually be replaced with Kubernetes and Deis Workflow. pmac and jgmize have bedrock deployed in our Virginia Kubernetes cluster. Minor issues with https redirects were uncovered, but have been resolved. Next steps are getting integration tests working and trying Cloudfront with this deployment.
Bedrock log analysis
The bedrock durable team is looking to gather some traffic metrics for /firefox, and using AWS Athena to query the data in the S3 bucket populated by Papertrail looks like a viable solution.
old resources in SCL3
pmac is going to followup on moving old SCL3 Bedrock resources to an S3 bucket for backup.
MDN
The SRE team has been working on the analysis of the existing SCL3 MDN deployment and it’s migration to AWS.
Below are some issues and PR’s related to this work:
Refactor existing S3 automation into a shared directory
region-specific resources (EFS) are easily created in any region w/ ~5 lines of env var + bash. Currently provisioned for our Virginia K8s cluster (us-east-1).
Giorgos is working on an unofficial New Relic Synthetics CLI tool:
“NeReS is a cli tool to manage NewRelic Synthetics monitors with a Synthetics Lite account (Pro should work too). The tool emulates the actions of a user in the browser and doesn’t use the Synthetics API since that’s only available to the Pro accounts.”
We’ll be decommissioning webwewant.mozilla.org. A webops bug has been filed to redirect webwewant requests to mozilla.org.
Decommission openwebdevice.org?
Looking into possibility of shutting down this site. Waiting for some internal communications before moving forward.
Nucleus
We’re planning to move nucleus to Kubernetes, and then proceed to decommissioning current nucleus infra.
Basket
We’re planning to move basket to Kubernetes shortly after the nucleus migration, and then proceed to decommissioning existing infra.
Snippets
Status unchanged since last week. Giorgos is looking at snippets-stats to see if it’s behaving correctly. The snippets-stats Route53 routing policy currently points at the Deis 1 deployment due to low stats alerts in new K8s environment.
New Kubernetes cluster
We’ll be creating a new Kubernetes cluster in Portland so we can take advantage of EFS to support MDN in that region. We currently run many of our services from Portland, Virginia, and Ireland. The new cluster will be created in an entirely new VPC, and existing resources will not be shared.
Here’s what happened in April in
Kuma,
the engine of
MDN:
Explored faster paths to AWS
Maintained quality with new robots
Improved the macros dashboard
Goodbye BrowserCompat API, Hello Browser Compat Data
Shipped tweaks and fixes
Here’s the plan for May:
Experiment with on-site interactive examples
More legacy cleanup and fixes
Ship the sample database
Done in April
Explored Faster Paths to AWS
The
AWS Migration Plan
details how Kuma and its backing services will need to evolve to fit into a
cloud architecture. However, there are good reasons to make the switch quickly
with a non-ideal architecture. We can minimize the painful transition where we
are supporting both the current datacenter and AWS. Some changes are
difficult to do now, and could be easier in the single AWS environment.
Linters are pedantic robots that find syntax errors, style mistakes, and
language misuse. They can be jerks about the small stuff so that code reviewers
can focus on the ideas in the new code. We added several new linters to our
development process:
stylelint checks Kuma’s stylesheets, which required
some work to get into the proper format.
(PR 4167,
PR 4170, and
PR 4170)
ESLint can be used to check Kuma’s JavaScript. Our JS
needs work.
(PR 4199)
EJSLint checks for invalid EJS
templates in KumaScript PRs.
(PR 154)
JSON Lint checks for invalid JSON
in KumaScript PRs.
(PR 159)
Improved the Macros Dashboard
We’ve shipped an improved
macros dashboard
which lets KumaScript authors see how often macros are used, access the macro
source, and find documents that use the macro. There are over 90 macros not
used on any page, so there are opportunities for deprecating and removing
macros.
Goodbye BrowserCompat API, Hello Browser Compat Data
BrowserCompat was a 2014
project to build an API to serve browser compatibility data, for MDN and for
other users. An API was a development-heavy solution, and we had to abandon it
in 2016 when we lost resources. This month,
stephaniehobson removed the MDN assets
supporting this project, so we can start shutting down the service.
For the next iteration of this idea, we’re hand-coding JSON structures with
the Browser Compatibility data, and working to make MDN the first consumer of
this data. You can follow the project on the
browser-compat-data repository.
In April, we shipped a first iteration of changing examples on MDN.
wbamberg and
Elchi3 created alternate versions of JavaScript
and CSS reference pages with short examples at the top of the page. We’ve
added an A/B test to see if there is a behavioral difference for users
seeing these examples.
In May,
schalkneethling will refine
wbamberg’s prototype code to add interactive examples for
CSS and
JavaScript.
This will allow users to test their understanding by making changes to the
short examples and seeing the results without leaving MDN. We’ll also
ship this as an A/B test, and analyze the results before planning further
rollouts.
More Legacy Cleanup and Fixes
We’re planning on reducing and removing more legacy features, to simplify the
Kuma project and make room for new development:
Remove the Vagrant development environment.
Remove the Ansible provisioning system, used by Vagrant and TravisCI.
Rework Zones, moving the styles to standard assets and simplifying
configuration.
Fix several bugs and misfeatures in client-side drafts.
Improve the KumaScript engine, macros, and development process.
Ship the Sample Database
The Sample Database has been promised every month since October 2016, and
has slipped every month. We don’t want to break the tradition: the
sample database will ship in May, for the anniversary of the project. See
PR 4076 for the remaining
tasks.
Here’s what happened in March in
Kuma,
the engine of
MDN:
Shipped content experiments framework
Merged read-only maintenance mode
Shipped tweaks and fixes
Here’s the plan for April:
Clean up KumaScript macro development
Improve and maintain CSS quality
Ship the sample database
Done in March
Content Experiments Framework
We’re planning to experiment with small, interactive examples at the top of
high-traffic reference pages. We want to see the effects of this change,
by showing the new content to some of the users, and tracking their
behavior. We shipped a new A/B testing framework, using the
Traffic Cop library in the browser.
We’ll use the framework for the examples experiment, starting in April.
Read-Only Maintenance Mode
We’ve merged a new maintenance mode configuration, which keeps Kuma running
when the database connection is read-only. Eventually, this will allow MDN
content to remain available when the database is being updated, and lead
to new distributed architectures. In the near term, we’ll use it to
test our new AWS infrastructure running production backups, and eventually
against off-peak MDN traffic.
We had a productive work week in Toronto. We decided that we need to make sure
we’re paying down our technical debt regularly, while we continue supporting
improved features for MDN visitors. Here’s what we’re planning to ship in April:
Clean Up KumaScript Macro Development
KumaScript macros have moved to
GitHub,
but
ghosts
of the old way of doing things remain in Kuma, and the development process is
still tricky. This month, we’ll tackle some of the known issues:
Remove the legacy macros from MDN (stuck in time at November 2016)
Remove macro editing from MDN
Update macro searching
Start on an automated testing framework for KumaScript macros
Improve and Maintain CSS Quality
We’re preparing for some future changes by getting our CSS in order. One of the
strategies will be to define style rules for our CSS, and check that existing
code is compliant with stylelint. We can then enforce
the style rules by detecting violations in pull requests.
Ship the Sample Database
The Sample Database has been promised every month since October 2016, and
has slipped every month. We don’t want to break the tradition: the
sample database will ship in April. See
PR 4076 for the remaining
tasks.