Django, K8s, and ELB Health checks

As you may have seen in several of our SRE status reports, we’re moving all of our webapp hosting from Deis to Kubernetes (k8s). As part of that we’ve also been doing some additional thinking about the security of our deployments. One thing we’ve not done as good a job as we should is with Django’s ALLOWED_HOSTS setting. We should have been adding all possible hosts to that list, but it seems we used to occasionally leave it set to ['*']. This isn’t great, but also isn’t the end-of-the-world since we don’t knowingly construct URLs using the info sent via the Host header. In an effort to cover all bases we’ve decided to improve this. Unfortunately our particular combination of technologies doesn’t make this as easy as we thought it would (story of our lives).

AWS ELB Health Checks

Here’s the thing: Amazon Web Services’ (AWS) Elastic Load Balancers (ELB) do not have many configuration options for their health checks. These checks ensure that your app on a particular node in your cluster is working as expected. If the check fails the ELB will remove the node from the list of nodes to which it will route requests for your app. However, because it’s hitting the nodes directly it doesn’t rely on DNS and directly requests the IP address and port, and it doesn’t allow you to specify custom headers (e.g. the Host header). It also can’t do HTTPS because we terminate TLS connections at the ELB, so the app nodes speak only plain HTTP back to the ELB. All of that means that our health check endpoint needs to do two unique things: allow HTTP connections and allow the IP address that the ELB requests as a valid Host header. The first bit is easy enough when using Django’s in-built SecurityMiddleware since it supports the SECURE_REDIRECT_EXEMPT setting. It’s this second requirement that gets more interesting when combined with k8s.

K8s Routing

The way I understand it (and I’m admittedly no expert) is that k8s (at least the way we use it) sets up a NodePort per app (or namespace). To hit that app you can hit any node in the cluster at that port and that node will route you to one of the nodes that is running a pod for that app. The important bit for us is that the node that serves this request is not necessarily the one that the ELB sent it to. So the Host header may contain an IP address for the node that was initially hit, but not necessarily for the node that serves the request. This means that we can’t simply add the IP of the host to the ALLOWED_HOSTS list when the app starts. We could get more info from AWS’ metadata service endpoint, but for security reasons we block that service from all of our nodes.

So, the approach could then be to simply add all of the IPs for all of the nodes in the cluster to the ALLOWED_HOSTS setting and call it done. The problem with this happens when there is a scaling event. When a node is killed and a new one started, or the cluster is scaled to include more nodes, you’d need to have a way to inform every running pod of this change so they could get the new list of IPs. If they didn’t update the list the new node(s) could be immediately excluded from the cluster because health checks would return 400s since their IP (host) would not be allowed by Django.

Enter django-allow-cidr

The way we decided to solve this was by implementing a Django middleware that would allow a range of IP addresses defined by a CIDR (Classless Inter-Domain Routing). We’ve released this middleware in a Django package called django-allow-cidr. The way it works is to store the normal hosts you’ve set in your ALLOWED_HOSTS setting, change that setting to ['*'] in order to bypass Django’s default host header checking in the HttpRequest.get_host() method, and do the checking itself. It does this checking via the same methods as Django would have, but if those methods fail it does a secondary check using the IP ranges you’ve defined in an ALLOWED_CIDR_NETS setting. It creates netaddr.IPNetwork instances from the CIDRs in that list and will check any host that isn’t valid based on your original ALLOWED_HOSTS setting. Failing both of those checks will result in an immediate return of a 400 response.

Conclusion

That was a long way to go to get to some simple health checking, but we believe it was the right move for the reliability and security of our Django apps hosted in our k8s infrastructure on AWS. Please check out the repo for django-allow-cidr on Github if you’re interested in the code. Our hope is that releasing this as a general use package will help others that find themselves in our situation, as well as helping ourselves to do less copypasta coding around our various web projects.