sdecoret - Fotolia


Digital governance, compliance complicated by data lake accumulation

Corporate data lakes have become common as businesses capitalize on their analytic benefits, but their prevalence should make companies rethink digital governance and compliance.

The proliferation of big data analytics in the business setting has given rise to "data lakes" – default repositories for all information assets that might be subjected to analysis, and that's been a great boon for companies looking to gain additional value from their data. But many companies allow rampant accumulation and dumping of data sources into these lakes, which unwittingly introduces regulatory compliance risks.

Data cataloging tools can help mitigate against these risks, but before we discuss those tools, let's talk about use cases in which data lakes can introduce compliance problems and why the lack of digital governance strategies related to data lakes is so risky.

One use case is described in section 153 d(f) of the Dodd-Frank Wall Street Reform and Consumer Protection Act, which allows the director of the Office of Financial Research (OFR) to subpoena a financial institution to provide data required by the office to oversee potential risks to U.S. financial stability. This subpoena power is not restricted to a specific type of structured database at a predetermined time. Instead, any data artifact could be requested at any time. This effectively means that financial institutions must be aware of all data artifacts they manage, as well as the information that data artifact contains and how that information is related to other data sets.

Another use case is the numerous data protection regulations that impose penalties for information exposure. The HIPAA Privacy Rule protects most "individually identifiable health information" -- data such as names, telephone numbers, addresses, Social Security numbers, also referred to as patients' protected health information (PHI). The HIPAA Privacy Rule states that "a covered entity or business associate must (…) implement technical policies and procedures for electronic information systems that maintain electronic protected health information to allow access only to those persons or software programs that have been granted access rights."

This requires monitoring and mitigating any potential PHI exposure risk, whether it stems from an individual hacker or a software application. Therefore, as data artifacts are accumulated in a business data lake or other type of high-volume repository, companies must not only determine which data artifacts contain PHI data, they must also be aware of how the combination of different data sets might inadvertently expose protected health data.

Big data analytics vs. compliance

Both of these examples highlight an emerging challenge for the digitized business. There are clear benefits of data accumulation for predictive and prescriptive analytics. This has inspired a number of organizations to ingest data sets from external sources to augment their own data sets extracted from internal transactions and operation applications. A number of organizations simultaneously focus on data reclamation, in which unstructured data artifacts such as old emails, documents and slide presentations are accessed from their respective archives and loaded into the same business data lake.

Data catalog guidelines

Consider these guidelines for developing a data catalog that can be used by company auditors to find regulatory compliance data:

  • Instantiate a metadata repository for capturing the data catalog that supports collaboration across the organization.
  • Acquire tools to automatically survey, profile and categorize data artifacts, as well as index and catalog the data sets based on their semantic meta-tag and data element information.
  • Develop and institute processes for reviewing, documenting and updating data set metadata.
  • Train staff assigned to collecting data set metadata.
  • Establish methods for searching for relevant terms related to compliance requirements.

This uncontrolled growth of massive data repositories in an ungoverned manner poses regulatory compliance risks. As more data sets are added to the data lake, it becomes difficult to rapidly and accurately respond to a data call from the OFR or to distinguish what PHI is at risk for exposure.

Proper digital governance of compliance data is a complicated process for the modern business: A company must identify and carefully document the organization's existing data assets, and how the information from those assets reflects known regulatory information dependencies. But in most cases, there are few individuals who know exactly what data artifacts exist, the classifications of the data within these data artifacts, how the data is accessed, who has the appropriate rights to access the data and what regulations are potentially impacted by their accretion within the business data lake. The absence of knowledge about the corporate digital governance environment creates impediments that are apparent the day that the call for compliance audit data arrives. Many individuals are left scrambling to figure out what data sets are relevant and how to accumulate the information required for the appropriate compliance response.

The rise of automated data cataloging

One approach to address these challenges is to use automated tools that survey and profile each of the organization's data artifacts to assess the type of information they contain. This enables companies to create and subsequently manage a shared semantic catalog. This data catalog raises awareness of what is contained within the various data sets by outlining details such as:

  • Business content, with a high-level overview of the real-world data types contained within the data set such as account numbers, names, locations and other abstract entity concepts.
  • The names of the attributes stored in the data set.
  • Details of any business departments/processes that created, acquired, read or updated the data set.
  • Where the data set is stored and how the information is accessed.
  • The access rights required for reading the data, and which individuals and applications have been assigned rights for accessing the data.

From a compliance perspective, a data catalog that embodies semantic data awareness helps ensure that proper safeguards are put into place for data protection and privacy compliance (see sidebar). In addition, having an inventory of data artifacts with the details of what information they contain can simplify responding to external data calls required to offset regulatory compliance risks.

Next Steps

Legal, compliance rules increasingly reliant on digital evidence

E-handbook: Digital governance in the big data age

BYOD and beyond: New tech driving modern data governance

Dig Deeper on Managing governance and compliance