Walden is our reference implementation of a data warehouse. After adding instructions for its deployment on Amazon's EKS last month, we are now also supporting it on Microsoft's Azure Kubernetes Service (AKS).
In the world of cloud infrastructure, Amazon's AWS and Microsoft's Azure are the two leading platforms. A recent review by Holori puts AWS in the dominant spot with 37% of the U.S. cloud market, with Azure claiming the 2nd spot at 23%. It is therefore not surprising that both cloud offerings feature managed Kubernetes products.
Azure Kubernetes Service or AKS is the default answer for anyone looking to run Kubernetes on Azure. This is a common-enough use case that we decided to write instructions for a minimal AKS cluster that can run Walden. Compared to EKS, creating this minimal cluster is easier on AKS -- there are fewer steps to run, and fewer permissions issues to worry about.
We did find that two things were reasonably daunting on AKS:
- machine selection. There are hundreds of different machine types available. Selecting the wrong machine type - zone combination will result in cryptic error messages, especially when done from the Azure CLI. We found the using the AKS GUI greatly simplified finding a good machine type for the intended zone.
- resource quotas. New Azure accounts start life with very strict limits on the number of resources one can provision. Most challenging for us was the fact that a new Azure account can only provision 10 CPUs in total. This is problematic given the number of services we need to run for Walden. Thankfully, we did manage to fit everything in 5 VMs w/ 2 CPUs. Scaling the cluster beyond the minimal configuration would likely entail a time-consuming quota increase request however.
These two aspects aside, we found AKS to be straightforward to interact with. We suspect that many people wanting to try Walden on AKS will already have access to an enterprise Azure account, and will already have ample resource allocations, as well as enough knowledge of their specific compute region to be able to select the right VMs for their needs. This should ensure an even smoother experience with this self-contained data warehouse.
More details can be found on the github.