Walden is our reference implementation of a data warehouse. We are now supporting it on Amazon's Elastic Kubernetes Service. Follow deployment instructions here, or read more information about our experience deploying a data warehouse on AWS below.
In recent years, Kubernetes has emerged as the de facto standard for cloud infrastructure orchestration. This development has come to greatly ease the pain involved in deploying complex, modular software on computers running in data centers near and far. All the major cloud providers (and increasingly, some of the minor cloud providers too!) now provide a hosted Kubernetes solution. Generally speaking this kind of solution works great. But even though we have all come to speak the same common language of Deployments, Services and StatefulSets, there are still many annoying little things involved in getting software to run in a specific hosted Kubernetes environment.
The variability inherent in various concrete Kubernetes solutions is why we decided to put together specific instructions for the deployment of Walden on the major cloud platforms. We quite like having access to software that "just works," which is why we are particularly thrilled to share our findings with the world. It typically takes us about 30-60 minutes to deploy and test Walden on EKS -- most of the time being spent waiting on the cluster creation step.
We quite like EKS as a product. AWS does an excellent job not getting in your way when getting started with a service. AWS is a mature cloud offering, and it's quite easy to get free support online. Adapting our deployment to EKS mostly involved figuring out some issues related to file system groups.
Using EKS also means you get to use eksctl, which is an amazing little tool. We particularly like how easy it makes deploying spot instances, as opposed to on-demand ones. Spot instances on AWS tend to be reliable solutions for Kubernetes workers, although you might run into some annoying issues with volumes when nodes invariably do restart. Speaking of which, Walden is meant as an educational tool, first and foremost. It's a good starting point for more sophisticated designs, and it's a good state to revert a cluster design to. But we empahsize that this is by no means a production data warehouse!
A word about costs. We spend under $5 USD to deploy a minimal Walden cluster (control plane and 4 r5.large machines). We expect running it would cost us about 0.5USD / hour. Of course, you, the deployer, are ultimately responsible for costs in your tenancy, so take these estimates with a grain of salt!
If you would like to just try Walden out on EKS, our instructions are here. When writing them we tried to assume as little AWS-specific knowledge as possible on part of the reader, so we also did our best to explain, at the high level, the few relevant AWS-specific concepts that we encountered in this exercise.