Data Lake or Data Swamp: it’s about governance



Data lakes have become a popular way for today’s financial institutions to store, manage and analyse their rapidly increasing data volumes. First described in 2010[1], data lakes are an innovative technology with great potential to drive business value. However, they have proven challenging to implement and manage successfully in practice, often earning the nickname “data swamps”.
Data lakes provide consolidated storage and querying capability across data of all types, taken from any source. They can be implemented via a range of technologies, but are most closely associated with the Hadoop framework, which supports processing and storage of extremely large data sets in a distributed computing environment. Data Lakes have been created to save data that “may have value.” The concept of a “catch-all” for potentially valuable data is often an attractive notion for the C-suite, however there are risks attached to this flexibility which should first be understood.
Keeping all the data
Developing a traditional data warehouse requires considerable up-front analysis to design a data model that will meet the needs of downstream consumers, and then build data flows and transformations to populate it (the “schema-on-write” approach). Decisions on what data to include in the warehouse are made in view of this up-front cost. Generally, data not needed to answer specific business questions or populate defined reports is excluded, as is data which doesn’t lend itself to storage in a relational database – web server logs, sensor data, social network activity, images etc.
In contrast, a data lake can store data of any type, format or structure in a raw form, which is then transformed on an as-needed basis (“schema-on-read”). This lends itself to the practice of storing data which may have value in the future, but where either the uses are not sufficiently understood to inform an efficient data warehouse design, or the business case does not yet justify the effort of doing so.
Supporting all users
In most organisations, upwards of 80% of data users are “operational” 2. Operational users want to generate reports, query key performance metrics or slice the same set of data in a spreadsheet on a regular basis. The data warehouse is for the most part ideal for these traditional business analyst users because it is by nature logically structured and easy to use. The remaining 20% of an organisation are made up of users with more complex and specific needs, who often venture outside of the data warehouse, either back to source systems or third-party data sources.
Modern data lake solutions can still be made to serve as conventional data warehouses for the purposes of operational users, who need access to specific, well-structured datasets to support routine processes including batch reporting and MI. However, they also offer more advanced functionality to analysts who need it, such as data modellers and data scientists, who need to explore a broad range of data from source systems or third-party data sources, for use cases such as Machine Learning / AI.
Challenges of Data Lakes
From the above, the Data Lake seems like a clear winner in the evolution of data storage. But the benefits rest on a big assumption; that the users of the Data Lake have all the information required to make effective use of the datasets it contains. This includes information about their provenance, semantics, and quality / completeness (or lack thereof). Data providers do not include this information by default – they must be compelled to do so by appropriate governance. Without this, data held in the lake is unusable and/or trusted by downstream consumers. A data lake full of such value-less data can be better described as a “data swamp.”
Data Lakes as a technology carry substantial implementation and compliance risk. By its definition, a Data Lake accepts any data, without oversight or governance, and therefore governance is essential to enforce controls on the appropriate use of data. By default, data lakes operate on a share-everything-with-everyone basis; recent concerns around privacy and ethics have seen many organisations rethinking this. GDPR in particular has raised the questions over data ownership, retention, deletion and correction.
Even when there is a strong governance framework put in place with the implementation of a new Data Lake, the rigor of the governance often degraded over time. We have observed a similar pattern across a number of clients, whereby strictly enforced practices governing data ingestion (all feeds must have known definitions, lineage and owners etc.) lead to an initial seeding of high-quality data, but that over time data of lesser provenance is forced in to the lake through dispensation processes as regulatory reporting and change projects take priority.
Conclusion
Data Lake technology brings the potential to harness value from “Big Data”, but with great power comes great responsibility. Improper governance can lead to a mess of opaque data that cannot be relied upon for regulatory reporting or business processes and is greatly limited in its value to the organisation. With increased regulatory scrutiny on demonstrable data quality, lineage and provenance in financial institutions, it is vital that organisations who choose to invest in this innovative technology do so in conjunction with a strong governance framework.
[1] A Brief History of Data Lakes, www.dataversity.net, November 8th 2018
[2] Top Five Differences Between Data Lakes and Data Warehouses, www.blue-granite.com, January 26, 2015