We discuss why data access is a surprisingly difficult problem to solve for large organizations. Neglecting this problem poses underappreciated yet existential risks to a company's ML strategy. We propose a few principles that can help enable data access in a way that both enables innovation and manages risk effectively.
It's a story as old as data science itself (so, about 11 years old!). Having just gotten back from a high-profile industry conference or after watching a fascinating tech talk, the company's executive team decides that it's time to take AI seriously. Exciting times ahead!
As with any large-scale corporate strategy, budget lines are set up, new departments are created and requests are made to recruit specialists for AI roles -- experienced data scientists or newly-minted quantitative PhD researchers. Interviews are held, hirings are done and the project is almost ready to start.
The newly hired data scientists make a jarring discovery, however. Despite the company's best intentions, not only is there no infrastructure to handle and analyze data, but there is no data to be analyzed whatsoever!
Avoiding the latter situation is the subject of this particular article series. We want to cover some of the main challenges related to data access and discuss the creation of a reasonable data access strategy. So that your AI story has a fighting chance of having a happy ending. More WALL-E and less 2001: A Space Odyssey😊.
Why data access is hard
Data access has some inherent risks associated with it. The main, painfully obvious risk is that data can be used in an improper way, which can lead to huge legal and financial consequences for your company. This is doubly true with personal data, especially if privacy and consent procedures aren't properly set up or communicated correctly. How to solve this problem is best left to legal professionals however.
Even assuming no legal risk, things can still become dicey from an ethical perspective. Recall for instance the case of a major US retailer from almost a decade ago. In that case researchers had managed to build a surprisingly accurate (at least in some cases) predictor of pregnancy status based on shopping habits. The ethical implications of creating this Machine Learning product did not become clear to company management until it became a public scandal which helped make the case for stricter AI ethics procedures. The prospect of such a debacle is very much on everyone's minds when doing AI in 2021!
Even when it is used for boring, and thoroughly ethically-/legally-uncomplicated internal reasons, data can still inadvertently reveal tightly-held business secrets. New product launches, market strategies or upcoming acquisitions of other companies can be derived from internal company data. Both legal and risk-management considerations often dictate that not everybody in the company can know about such plans. Data access requests also need to be seen from the perspective of business secrecy.
Furtermore, unguarded sensitive data is also a security liability, as the ever growing list of data breaches and cyber-attacks reveals. There are very good reasons (some of them dictated by law) to keep personal identifiers, health information, financial information and the like under lock and key, so as to minimize the risk of attackers making away with it.
All of these are valid reasons why data access needs to be restricted, even inside an organization. But even when good datasets exist and could be shared internally, there is often a hesitancy against openness. This reticence has many sources -- complicated internal procedures, lack of time, or just plain old internal politics.
Exacerbating all of the forces working against data sharing is a lacking data governance strategy. By this fancy term we those policies that manage how the data is gathered, stored, processed and disposed of in the organization. Data governance strategies enumerate who should have access to what kind of data, and are a crucial tool in ensuring compliance with external laws, regulations, and standards, from GDPR to ISO-27001.
The word of data governance is new, rapidly evolving and immensely confusing and jargon-laden. It is not surprising, therefore that many organizations would treat data governance more as a box to check on their to-do list, rather than an essential part of the design of their data science organization.
Data gridlock
It is a surprising observation to many outside of data science that many (most?) "real-world" Machine Learning projects start with no data. The situation is disappointingly rare where you, the data scientist, get called in to solve a problem that looks and feels like the neatly-packaged challenge seen in an ML course, or in a Kaggle competition. It is quite often the case that relevant data is, theoretically available somewhere in the organization, but you as the data person have no access to it, as the data "belongs" to other departments such as sales, marketing or accounting. This state of affairs means that you first need to spend time scheduling meetings with department heads, making your case and going through several mind-numbing internal procedures before you get get to run any SQL queries.
There is then the expectation of knowing in advance what data is required for the project. When "negotiating" data access, it is not at all uncommon for the "owner" to want to know, which columns would the requester want access to. That this question has an answer at the point at which it is asked is often not a realistic assumption when it comes to science in general, not just data science. It is maddeningly difficult to predict in advance which tables, columns and rows would be useful to the question at hand, without spending some time working with them!
An even more scary scenario (which does come to pass!) is one where the organization simply does not have any data which is relevant for the project scope. In this case the data scientist needs to devote time, energy and resources to setting up a data acquisition process (whether through direct data collection, or by buying an external dataset), and getting it approved and funded.
Simply put, many companies are not prepared for data science initiatives, especially when it comes to their internal culture and processes. The negative impact of this gridlock on R&D initiatives is significant. Because of this way of doing things, scientists will usually stick with analyzing whatever data they can access. Such a "dataset of convenience" might not be up-to-date or even all that relevant to the project.
Data access friction also translates into reduced or absent collaboration between departments. Important conversations will never happen, and key questions will be never be asked if every new idea requires the scientists' going through a long and tedious process to receive access (or not).
As a result of these data access chilling effects, ML initiatives risk failure in providing the business value to the company that made them so promising in the first place.
Data Access Best Practices
By now it should be clear data access can be a big headache. But what can you do to make sure you avoid data gridlock? Our proposal centers around a slightly different philosophy.
Start by assuming the best out of everyone with data access. Presume that people will use the data to do their job as best they can. That they are going to respect the requirements that have been set up to protect the data. And that they will act in the best interest of the organization.
Advocating for trust on this level, in the face of all the risk may seem foolhardy, but a trusting environment is also one in which innovation can thrive.
Trust should note be blind however, as Gorbachev and Reagan discovered when negotiating a very high-level nuclear-weapon reduction agreement in 1987. "Trust, but Verify" goes the Russian proverb that became the basis for detente between two strategic adversaries.
This seems like a sensible principle not just for nuclear weapons reduction, but also for the more pedestrian data access story discussed here. One needs to make sure you extensively document the scope, purpose and duration of data requests as well as the data assets created as a result of these requests. Being systematic about recording goals and scopes of access allows for an overview of who has access to what data, why, and for how long. This is incredibly important for managing the various risks of data access discussed previously.
A conscientious bias towards openness is another important principle. That is, make sure that when you limit or close access to data, this is the exception rather than the rule. Every project has sensitive data. But in order to avoid bottlenecks or delays, one should clarify what sensitive data means in the context of the project, as well as what steps those involved in the project need to take in order to access it.
Last but not least build tools to support the process. Try to automate as much of it as possible. From data access requests to making the datasets within the company easier to discover.
This is the first of a series of articles. Next, we will look at some concrete ideas around how you can manage access to sensitive data and mitigate the risks associated with it.