Trust but Verify. Implementing your data access infrastructure

5 minute read Published: 2021-10-07

Welcome to the third part of our article series on data access. In the first two we focused on why sometimes these data access initiatives can fail and what you can do about it , as well as how to set up your system to handle employee access to sensitive data

In this one we will look at how a data access system might look like and talk about some software options that you can use to make your life easier and increase the chances that your data science project will be a success.


Data discoverability

First things first. Your data needs to be discoverable by those who have to work with it. A centralized data storage is the most desirable option but it’s often unachievable. There are a multitude of reasons why this is so. Costs can be high, the software solutions you use to gather and transfer data may lack interoperability with the data storage solution, data transfer speeds might be slow and so on.

Not to mention that this process comes with a whole set of implementation challenges, especially in large organizations. It requires a big financial investment and a change in your internal infrastructure to make all the systems work and communicate together. So you will probably need a separate business case for the higher ups. And if you’re already working on changing the way your company views data access it might be a bit too much at first. So, you should initially go with something low cost or open-source.

At a minimum you need a data catalog. A place where you can upload all the data assets you have and make them discoverable across your entire organization. It would be even better if it can also give you some options with regard to limiting access to sensitive data sets such as HR, financial or client information.

A great open-source solution you can use to catalogue and curate your data is Amundsen. Initially developed at Lyft (the ride sharing app), Amundsen is a data discovery and metadata engine. It indexes and ranks all the data resources you might have such as tables or dashboards, which can then be more easily searched by users. They market themselves as “Google search for data” and it’s a great option to use if you are just getting started.


Data access enforcement

After making sure that your data can be discovered, the next step is to secure it. Meaning that you are able to manage, monitor and enable access to it.

There are some existing open-source solutions which you can use such as Apache Ranger or Apache Knox. Another promising one in the field of data access enforcement is Open Policy Agent (OPA) together with Rego. OPA is an open-source policy engine which allows you to set up your rules regarding data access as code. Below is an example that uses OPA + REGO to create some rules.

REGO Data Access

It’s a tool you can use to assess very complex yes / no questions. For example, "Is the user making these requests allowed to access the tables in this query?". To answer this, it uses the user permissions and the query “facts as inputs” and generates a yes or no answer as an output.

OPA output

Assessing your strategy

As with any other business initiative you need to be able to track the way it’s used and how efficiently it’s used so you can make changes if necessary. When it comes to data access you should look at:

Another way to go is to use A/B testing and see what impact auto-grant vs. human review has on productivity.


Conclusions

This brings us to the end of our article series on data access and how you should go about implementing it in your company. It’s been quite a journey. So, let’s go over some of the key ideas you should keep in mind as you prepare your own strategy.

  1. Before you start any AI or ML initiative make sure you’ve ironed out how people can go about accessing your datasets
  2. Define what you see as sensitive data and establish a clear set of rules for accessing it (bear in mind that there are plenty of opportunities to automate this)
  3. Non-sensitive internal data should be open and discoverable across the organization
  4. Document and audit everything
  5. A good building block to start with in terms of infrastructure is: OPA + git + Kubernetes