Introducing Walden

5 minute read Published: 2021-02-15

We have built Walden, a small data lake for (mostly) solitary use, consisting of a set of configurations and images for deployment into a Kubernetes cluster. We are releasing the code as free and open source software, hoping to lower some of the barriers to entry to the world of big data and AI. Check it out on our github, or read below for more info!

Interacting at all with your data can be a daunting challenge for many scientists. When starting a new project, you may have gigabytes and gigabytes of... stuff you could be working with. Your data may be so massive that you would not be able to process it efficiently in, say, a single R process. Or maybe you need to run data processing multiple times, trying different things. This often happens when training ML models, for instance, but is increasingly a generic problem across the sciences.

Researchers are creative when it comes to dealing with the management of large datasets. From networked file storage to Dropbox or Google Drive and Amazon S3, there are many ways to store the data, but they are all fundamentally messy -- and this messiness often becomes apparent six months into a project, when the initial good intentions have been replaced with the accretions of a thousand experiments.

There is, of course, a better way. Modern data engineering has coalesced around the "data lake" as a general purpose solution for storing and interacting with large amounts of data. When it comes to tabular data (e.g. CSVs), data lakes backed by a SQL data processing engine have practically become the norm. Mature data lake solutions are available on a pay-as-you-go basis from the big cloud versions, Google BigQuery being one such example we particularly like. These solutions, though technically excellent, are nonetheless expensive and require uploading data to the cloud. Many research teams simply cannot afford to do this, due to financial or legal limitations.

So, what is one to do when large amounts of data must be processed on-site, but at low cost? The answer requires stringing together a number of modular open-source tools. Doing so is not for the faint-hearted or the impatient however -- tools are often in various state of development and very confusing bugs are to be expected in any version of the code. But most importantly, it is difficult to even understand what some of these tools do, and how they combine together without a background in data engineering. Yet, they are very useful tools that deserve to be easier to use.

We found what we thought are three excellent such tools -- MinIO, The Hive Metastore, and Trino and got them to work together in a Kubernetes environment. Kubernetes (or K8s) may sound daunting, but it is really just a way to easily distribute computation across multiple machines. If you have never seen it before and are comfortable with the Linux command line you can try it out via K3S, a lightweight version of the system. We have also found it to capable of growing to fit modest requirements - we ourselves use K3s for a twelve-node cluster in our office used for development and testing.

The tools mentioned here create a kind of "Google BigQuery in a box" on local machines. MinIO provides object storage, the Hive metastore keeps track of table schemas, and Trino (formerly known as Presto) runs SQL against these tables. Trino even supports basic Machine Learning, internally and it's often excellent to use in the exploratory stages of a product.

Walden is meant as a "middling" data tool, for datasets with tens of millions, maybe hundreds of millions of rows. It is meant as a quick way to get started analyzing data by providing a reasonable environment where services can talk to each other. We got to a working configuration after months of iteration and occasional bug-fighting, and the resulting frustration made us want to put something useful out into the world.

There are obvious limitations to Walden. It does not include all the typical components of a modern data management stack -- most notably, there is no ETL solution (but we may add Airflow in the future!), and there is also no solution for ML training beyond the limited toolkit provided by Trino. Many iterations would be necessary to turn this into a production-ready solution. But we believe it's a good place to start when you just want to work on your data.

If you would like to try it, the code plus installation instructions are available at github.com/scie-nz/walden. We are planning to continue iterating on the stack as we use it ourselves.

(We are particularly grateful to the excellent documentation published by Vitaly Saversky on his blog)