James Thew - Fotolia

Data anonymization techniques less reliable in era of big data

Data anonymization techniques are designed to preserve privacy of shared data, but do they work with high-dimensional data? Here's what experts have to say.

Privacy is an endangered species in this data-abundant world. While it is critical that organizations leverage data to gain insights, predict trends and spark innovation, sensitive personal information is often put at risk.

To enable use cases like data analytics and data sharing in a privacy-conscientious way, organizations turn to data anonymization techniques.

"Data anonymization is a two-step process -- pseudonymization and de-identification," said Yves-Alexandre de Montjoye, research scientist at the MIT Media Lab, who will join Imperial College as a lecturer. "The idea, if it were to work, is to take sensitive data like mobile phone and medical data and remove any information that can link it back to an individual. We can then use it in research for example without endangering people's privacy."

Most data anonymization techniques rely on two types of identifiers: direct and indirect, de Montjoye said. The process of removing direct identifiers like names, street addresses, phone numbers -- any piece of information that directly allows tracing data back to a certain person -- is called pseudonymization.

Pseudonymization of data is achieved by either removing the identifiers or by replacing them with a random ID or hashing them with a salt, de Montjoye said.

Yves-Alexandre de MontjoyeYves-Alexandre de Montjoye

The main benefit of data anonymization is it allows organizations to get more usage out of the data, said Gartner analyst Ramon Krikken.

"Data anonymization techniques allow organizations to modify the data in such a way that the privacy of individuals within the data set remains protected at least in some way," Krikken said.

But in the era of big data, data anonymization techniques fail to deliver because there are hundreds of thousands of data points for a single individual, de Montjoye said.  

As evidence, de Montjoye pointed to the research he conducted with his colleagues in MIT that showed how it requires just four pieces of information to identify 90% of the people in a data set containing credit-card transactions of over a million users.

"Taking the data and pseudominizing it, removing the direct identifiers and trying to add noise or do other things to prevent re-identification, basically just doesn't work anymore when the data becomes high dimensional," de Montjoye said.

One hundred percent de-identification cannot be achieved, Krikken reinforced, because there always remains a possibility that certain items within that data will be identified.

Still, conserving anonymity in large data sets should be of utmost importance for organizations, because re-identification of data could affect its brand name and create legal consequences, Krikken said.

The organization is responsible for data management and anonymization, unless somehow the liability is passed onto the anonymization provider, said John Isaza, head of the information governance and records management practice at Rimon P.C. law firm, in an email interview.

"Even then, the organization would be the one facing the public relations exposure, which in some respects could be more costly than the cost of a breach," Isaza said. "Under the new General Data Protection Regulations coming out of Europe, the organization could face a sanction of up to 4% of its annual gross global revenue in the event of a breach or data mismanagement."

With data anonymization techniques failing to scale, what's the alternative?

The solution lies in computer science research, de Montjoye said.

He is currently associated with a project called OPAL, which focuses on building secured infrastructure that will allow the use of data while giving people strong guarantees that the data is used in a privacy-conscientious manner.

"The way we do it is by making sure we automatically control how the data is being used, what comes out of the secured environment, and how it is being aggregated, ensuring the data is being used anonymously even if the data itself is not anonymous," de Montjoye said.

Next Steps

Learn the benefits of sharing behavioral health data.

Learn how data-driven marketing elicits privacy concerns.

Privacy Shield puts spotlight on data privacy.

Dig Deeper on Risk management and compliance