Introducing TileDB Rust bindings

3 minute read Published: 2021-05-27

We have built Rust bindings to the TileDB tensor storage engine. As good Rustaceans do, we're open-sourcing the bindings and are quite excited about the possibilities that universal tensor storage open up.

Tensors -- a fancy name for arbitrary-dimensional arrays -- have recently been in the news thanks to their use in neural networks research. Storing tensor data is a long-standing problem in data science however. Geographers and astronomers in particular have long contended with storing very large raster files, using formats as diverse as the Hierarchical Data Format, or GeoTIFF. Tensors are also now well-known as the main data structure used in neural networks research -- they are the building blocks of models expressed in frameworks such as PyTorch, the aptly-named TensorFlow, or earlier, more general-purpose frameworks such as NumPy or GNU Octave

The problem with existing tensor storage is scalability. There are many massive tensor datasets, such as the excellent, free satellite data provided by the Sentinel 2 missions. We recently started working on a project using this dataset, and found ourselves in need of a Big Data solution to access the data. TileDB is an extremely exciting answer to this very problem. It promises to do for tensor data what Hive (the Hive metastore and the Orc format, to be more specific) did for tables -- provide a universal abstraction for storing arbitrarily large amounts of data in a format. Like the Hive metastore, TileDB integrates with a wide number of query engines, including Trino and Spark, as well as providing APIs for R and Python.

Efficiency is one of the problems encountered in ingesting very large amounts of raster data. This is an area where languages like Python and R, with their potential for memory leaks, are likely to struggle. Like many high-performance libraries TileDB itself is written in C, and the library exposes a C++ API as well. Writing code in C/C++ is itself a daunting tasks for most professional programmers, not to mention all the other people who code every so often (as in, scientists!). This is the situation where the programming language Rust offers a very exciting way out -- safe code that is also fast. We have written about Rust's advantages for science previously, and Nature seems to agree!

There was one thing missing for us to be able to start loading data into TileDB using Rust -- there were no Rust bindings to speak of. This was not a huge problem however -- we just created TileDB Rust bindings via the bindgen crate. We are open-sourcing these bindings, and publishing them on for other Rust programmers to use, and, hopefully, to encourage other scientists who use tensors and need speed to give Rust a try!