At Thought Machine we’re building a cloud-native core banking platform called Vault. Two of the main requirements we have are:
- being able to handle millions of users and thousands of transactions per second, and
- deploy to a variety of different cloud providers and custom data centres
On top of that, we want to be able to deploy our code as quickly as possible while still being able to rollback if anything goes wrong.
In this article we will describe three main components of our infrastructure: the CI/CD pipeline that enables development at high velocity; the underlying OS and network stack supporting Thought Machine Vault; and the tools that developers have access to to monitor their services.
Up until November 2016 we had been deploying microservices to AWS EC2 instances using cloudformation and autoscaling groups. We developed thin wrappers around Troposphere and allowed developers to set up their own load balancers, security groups, launch configurations and more. Our infrastructure was defined as code, we could run automated tests before deploying, it would go through the same code review process as everything else and we could organise our infrastructure into modules to avoid repetition. At runtime, small Puppet modules would then download the required artifacts onto the machines and configure them to run.
This worked well for a while but did have some downsides.
- Since we were using Cloudformation, we were completely tied into the AWS ecosystem. Our microservices could not be deployed to other cloud providers without a significant rewrite.
- VMs would take a long time to start up (around 5 minutes) when rolling out updates to our microservices.
- We were unable to clearly determine the exact cause when a microservice failed. Was the failure caused by changes in a security group or was there simply a bug in the microservice?
- Our Puppet modules lived in a separate repo adding an extra barrier for developers who needed to update runtime configuration of their microservices.
It was time to look into something new. Some of the properties we were looking for were:
- Empower engineers to be able to truly own their deployments and rollouts. The infrastructure team should not be a gatekeeper for day-to-day activities (e.g. rolling out a new application version or even adding an entirely new microservice). Product teams should be able to go from prototype to production with no involvement from the infrastructure team.
- Clear separation between microservice deployments and infrastructure changes. This would allow us to easily understand who was responsible in case of failure.
- Fine grained authentication and authorisation, application level access control and removing the necessity for developers to access cloud provider consoles, APIs or UIs directly.
- Storing deployment configuration next to application code to avoid having to jump around multiple repos when making changes.
- Fast startup times when rolling out updates to our microservices.
- Removing any need to SSH into VMs to debug failures. Logs and additional useful information should be readily available to all engineers when something goes wrong.
Kubernetes allowed us to clearly define responsibilities between the product and ops teams. It provided the primitives necessary - such as Services, Deployments and ConfigMaps - to roll our software out in a fast, predictable manner. The product teams could manage their own deployments while the ops teams offered the monitoring services needed to ensure that their deployments went smoothly.
All code at Thought Machine lives in a single monorepo, where we maintain code in multiple languages (mainly Python, Go and Java), as well as our Kubernetes resource files and infrastructure code. This allows the infrastructure team to make sweeping changes across the board in an atomic way. In the past, we maintained a separate repository for Puppet modules, but coordinating changes and deployments between multiple repos was difficult and error prone.
We have a strict code review policy where any change must be reviewed by at least one person (including a senior developer depending on the sensitivity of the code), a policy we’ve enforced from the beginning. It’s proven to be an amazing way to educate others, discuss designs or to simply notify other engineers of what’s going on. As soon as a PR is raised your changes will be picked up by a build agent and all of the affected tests will be run in the background (we use specialised on-premise servers for this).
Once your branch has been merged into master, we determine the Docker images that have been affected, re-build them and tag them with a unique hash based on the contents of the image. We then move on to template our Kubernetes object definitions (explained below).
Since we treat Docker labels as immutable, updating our Deployment YAML files every time an image changed would be painful. Instead, we template these values in at the end of the build. An example of a committed deployment might look like:
NB: The image tags prefixed with
// refer to specific build targets (built using plz) within our repo. Each target has associated metadata (such as image name and sources) which are specified in the BUILD file for that directory.
Once the build is complete and all tests have passed, we know exactly what those image tags are and can template them in. The resulting spec might look something like this:
These Kubernetes objects are then automatically rolled out to our initialdeveloper environments.
OS and Network Stack
We have currently deployed our Kubernetes clusters to AWS though we are actively experimenting other cloud providers such as GCP. CoreOS Container Linux (which is the OS that all of our EC2 instances run) allows us to be confident that all of the latest security patches are being applied as they come out (patches are applied automatically and VMs rebooted without human intervention). Some basic configuration is applied to the VM at startup and the Kubelet is run as a systemd unit. Everything else runs on top of Kubernetes.
Flannel is responsible for automatically assigning a unique IP address to each pod as it spins up. When a new worker node is added to the cluster, Flannel will reserve a chunk of the cluster-wide podnet and periodically renew its lease to avoid conflicts with other nodes. These pod IPs are viewable when querying for pods using
$ kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE nginx-5vrpq 2/2 Running 0 11d 10.XXX.14.3 ip-10-XXX-72-162.eu-west-1.compute.internal nginx-7j72k 2/2 Running 0 12d 10.XXX.12.2 ip-10-XXX-150-131.eu-west-1.compute.internal nginx-q2f5j 2/2 Running 0 12d 10.XXX.11.19 ip-10-XXX-1-39.eu-west-1.compute.internal
By default, all pods on a cluster can talk to each other. This raises some security concerns; for example, we might want to limit public egress network traffic of all pods (except for edge pods or web servers), or isolate highly sensitive pods from the rest of the cluster. We can solve this problem by using Network Policies. Calico runs in parallel with Flannel, and allows us to specify network policies to limit network traffic between different pods. These network policies are based on high-level constructs such as pod labels, which means that product teams can write their own policies using the same language they use to write deployment files. Calico runs as a privileged container, listening to changes in network policies and rewriting iptable rules to enforce them (in practice, we’ve seen updates being applied very quickly, in the order of 1 or 2 seconds).
Even though product teams fully own their deployments and rollouts, the infrastructure team provides a series of cluster-wide services that each deployment will get for free when running in the cluster. These include DNS resolution, monitoring and alerting, logging, secret management and an in-house release management tool.
Monitoring and Alerting
To collect metrics, we use Prometheus, an open-source tool originally developed at SoundCloud. It uses a pull model to scrape metrics from many endpoints (normally specific pods in the cluster), allowing you to store and query this time series data (generally using a frontend like Grafana). Our cluster-level monitoring system provides basic metrics for all pods, (though developers are also encouraged to expose custom business-logic metrics which will also be scraped automatically). To do so, we deploy the node-exporter DaemonSet, which exposes metrics for each pod on that node: CPU and memory usage, disk and network I/O, etc. Alerts (using the Prometheus Alert Manager) are triggered once specific conditions have been met (e.g. if the number of replicas backing a given service goes below certain threshold or if the latency for a given SQL query is too high).
We quickly reached the limits of a single-instance Prometheus setup and have since federated metrics across multiple instances. This has allowed us to scale to many thousands of metrics, capturing everything from CPU usage to individual SQL query latencies or inter-service RPCs.
Similar to our monitoring setup, we deploy a Fluentd DaemonSet which is responsible for collecting logs for all pods running on a given node. Log messages are batched and forwarded to an ElasticSearch cluster. Kibana is then used to search and view log messages.
After testing with a simple ElasticSearch setup, we realised that it can easily get overloaded even with even a small amount of log messages. Our current cluster includes three different types of pods: data pods responsible for storing ES indexes (deployed as a StatefulSet), API pods to handle queries to and from the ES cluster and master pods to coordinate the whole process (both deployed as standard Deployments).
As with any cloud application, microservices need to have access to sensitive pieces of information such as database passwords, API keys, etc. These values are never committed in the codebase and dummy values are used for local development.
To ensure that these secrets are stored in a cryptographically secure way, we leverage HashiCorp Vault and its Kubernetes auth backend. Developers can specify the secrets that they need access to as mounted volumes in their deployment. These will automatically be fetched at runtime by an init container using the deployment’s service account to authenticate with HashiCorp Vault. This allows us to have fine-grained policies around what secrets each microservice can access, ensure that tokens are short-lived and secrets can be rotated on demand. Unusual activity can easily be flagged based on detailed audit trails.
We also use HashiCorp Vault for internal or corp services. Of these, our main use case is issuance of short-lived (~16h) certificates to access our Kubernetes clusters. Engineers will generally log in once a day (using their LDAP credentials) to gain access to the cluster. These certificates map to RBAC roles, allowing us to easily update permissions across different teams or organisations. Engineers can also request higher-privileged certificates if necessary, though these only last 30 to 60 minutes.
As with everything, there are no silver bullets when it comes to deployment management, and Kubernetes is no exception. The end result with Kubernetes was positive — and we have no plans of moving away from it — but it has been a lot of work. We introduced many new technologies which used a very different terminology to what we were used to. It required re-educating not only the infrastructure team but also the product teams, since ownership of certain components would be shifting. The additional layers of abstraction (especially at the network level) make some situations harder to debug than usual.
We are happy with our current Kubernetes set up, but still have a lot of work to do: extending our deployments to other cloud providers like GCP and Azure (as well as on-premise, specifically for our build agents and internal infrastructure), investigating service meshes (such as Istio or Conduit) to improve connections between microservices, and making our deployment process as simple as possible.
Thought Machine is hiring for every role: backend, frontend, infrastructure, security, PMs, sales, etc. If any of this sounds interesting and you’d like to apply, please head over to Workable.