Beyond R and Python: Rust for Science

6 minute read Published: 2020-12-10

Rust looks like an increasingly important part of the future of scientific programming. It's as fast as C++, as flexible as Python, and does packaging better than even R. It is also very difficult to learn, and demands a lot of cognitive investment. On balance, the ability to write clean, fast, and safe code is worth it, but be prepared to re-learn programming.

In the past two years I had something akin to a geeky conversion experience. I first saw Rust in action in a programming languages class at Stanford and was immediately mesmerized by the possibilities. At the time part of an unrelated project I was working on involved interminable fights with the JVM trying to debug why some parallelized Java code was causing OOMs due to objects that were not garbage collected. Or, worse, trying to understand why some objects had gone away, my code failing with the dreaded NullPointerException. Will Crichton's excellent examination of programming languages offered a way out.

Rust, the "low-level" language we examined looked unusual in its enunciation of one simple principle related to memory management -- a variable having a single owner. Ownership can be passed around, but is always vested in a single scope. At the time the realization you could build a programming language that followed these principles blew my mind. Garbage collection becomes meaningless in a world where you can simply discard data when it falls out of scope. The idea of memory "ownership" -- i.e. a kind of property right that can be alienated or temporarily ceded to another holder -- also resonated with the social-scientific side of my brain.

On top of this simple principle of memory ownership, a wonderful edifice of increasingly sophisticated memory management primitives is built up. References (simply borrowing a value for some amount of time), reference-counted pointers (allowing multiple scopes to hold references to the same memory location, with objects only being destroyed when no one holds a reference to it any longer), atomic reference counted pointers (when reference counting needs to be done between threads), boxes (for dynamically-sized types, the default being the size of every value is known at compile time), mutexes and read-write locks (ownership explicitly passed among threads) -- all fitting together in a neat crescendo when the starting point is ownership. Other languages, most notably C++, also offer such flexibility, but Rust really impressed me its ability to reason about memory management at compile time. The compiler's error messages really reinforced its strength in my first encounter with the language. Where C++ would have let you run some wonky code that would eventually result in a segfault, Rust obstinately complained. You could fix the code, or if you were sure you really knew what you were doing, wrap it in an unsafe { } scope, essentially telling the world you did, indeed, really know what you were doing.

This sort of paranoid compiler has its obvious advantages, but it also creates a huge barrier to entry. In some languages, you can write 100 lines of code, and it will run, after some minimal syntactic debugging. Progress feels tangible in Python, where I can sometimes code a solution to a problem in a few hours of Jupyter hacking. By comparison, Rust always sends you back to the drawing board, demanding to know, for instance, why you thought it would be a good idea to return a reference to an object which should have been freed (because no one owns it anymore). Thus the sad truth. When I first took on using Rust for a work project, I was at least 10 times slower than when using Python for a comparable task. Yet, slowing down and doing things correctly was nonetheless worth it. I even find that the experience has improved how I approach project structure in other languages.

What justified switching to Rust was sheer exasperation with the unmaintainability of a Python code base. This was the case with the project that finally gave me an excuse to turn to rust. The project required the implementation of a complex algorithm, with many memory accesses and the potential for redundant copying of data. I undertook the rewrite reluctantly, knowing full well how difficult getting the code to work would be. I was not mistaken -- I emerged after 48 hours of fighting with the compiler with a working solution, which would have probably taken 2-4 hours of work in Python. But this solution was incredibly fast and didn't have the memory leak that led me away from Python initially.

Seeing the code work sealed the deal. I had just implemented, from first principles, a complex algorithm. Programming was fun again. It was fun because the exercise of fighting with the Rust compiler was ultimately an exercise in thinking about the design of the code itself. It is an intuitive realization that comes from having spent a while coding that the design of the code evolves along with the code. In that respect programming is a bit like writing in that you don't quite know where your keystrokes will take you when you start. The Rust compiler is akin to a great editor who entertains your indulgences and explains you why they're bad.

There is real potential for science in this language, not as a replacement for existing tools, but as a way to empower people to do things that were previously too difficult. Asynchronous code, actor-based computing, easy networking, all offer the building blocks for large scientific computing applications that need to be both fast and safe, their state distributed across multiple machines. The usage of traits, rather than classes, makes for a sophisticated -- albeit initially counter-intuitive -- approach to code re-use.

Beyond bells and whistles, working with a programming language is also about a lot of "little things", and that is where Rust also excels. The Cargo package and build manager offers a neat replacement for pip, easy_install and conda, rolled into one. Rustfmt makes complying with style guides a breeze. Unit tests are incredibly easy to write, requiring no additional packages to run.

Rust will not replace either R or Python any time soon for scientific applications though. A lot of the high-level building blocks available with a few keystrokes in these languages are simply not available in Rust. But an increasing number of people are working to fill in the gaps, particularly when it comes to machine learning. Indeed, it may be that Rust code becomes the metaphorical connective tissue binding together neural networks, via bindings such as tch. It's a language that has its place in the expanding universe of what we now call data science. Simulations, statistical inference -- anything requiring dealing with data at a large scale -- stands to benefit from learning more about this language. The cognitive costs are ultimately worth it.