Everyone wants a data hub; few see it pay off. Here’s why.



The idea lies at the heart of nearly every corporate data strategy: “Let’s get all our data in one place. Then we’ll be able to discover hidden game-changing insights by joining previously disparate datasets together. Meanwhile, downstream business consumers will get a single, standardised source of truth for all their analysis and reporting.”
What’s not to like? The concept of a data hub (or store, lake, warehouse, or other preferred terms) is intrinsically appealing and holds an undeniable logic. But it’s not exempt from a mundane but important principle: investments in data infrastructure, like any other, need to pay off. There are many eye-wateringly expensive data hub initiatives, and few which return a *measurable* financial benefit to the organisation.
Most proponents of such projects argue that the payback period from good data management is long – years or decades, and the benefits are indirect and hard to measure. This is not always enough to sway sceptics, who contend in response that data hubs are an expensive fad – snake oil peddled by Hortonworks, Cloudera and their cloud enablers.
Often, those pushing for a data hub in their organisation don’t *really* know why they want one. This sees them fall victim to technology and operating model choices which ultimately undermine the value from the investment.
Non-reasons for building a data hub
Data hub project business cases promise the following benefits:
- Efficiency savings from architectural simplification (maintaining fewer bespoke point-to-point application interfaces for exchanging data across the enterprise)
- Improved data quality by implementing standard data validation controls and DQ monitoring at the point of ingestion
- Compliance with regulatory mandates to prove lineage back to ‘golden’ sources, e.g. BCBS239
These are intuitive, but dubious. Provider-consumer interface contracts can be standardised, data quality controls can be implemented, and data lineage can be mapped and documented, all without introducing a data hub into the landscape.
So – why do it?
Most data hub projects aren’t about the data hub
Real reasons for building a data hub are:
- To actually do big data analytics, which relies on massive datasets being colocated on Hadoop-friendly file storage (evidence of real business value creation from this for financial organisations is only just starting to materialise across the industry)
- To get management to agree to an otherwise-boring package of data management work needed to achieve the benefits cited above. For this, the data hub functions as an internal marketing vehicle for data management: it’s a tangible product, which leaders can point to as evidence of concrete progress in the data agenda.
The latter is a cynical but legitimate motivation behind many data hub projects. The challenge it creates is that if a data hub is conceived as a headline item rather than a solution to the practical problems of the business, it’s often unoptimised for addressing them – and can easily become a hindrance rather than a help.
Bad data hubs
Examples of issues which can see data hubs fail in their mission:
- Big support organisations are built to surround and ‘own’ the data hub, acting as bureaucratic administrators and gatekeepers in any process which needs to interact with it – often in the name of control and compliance. This adds process friction, stifles exploration of the data, and runs up costs to the organisation without generating business value.
- Data hubs are built alongside legacy processes by which data providers publish to consumers outside of the hub, with nothing done to force or incentivise migrations that see the hub put to actual use. Again, the net effect is to increase cost and complexity for the business with no obvious benefit.
- Insufficient effort goes into curating the content within data hub which is likely to be of high value to the business (if the business even knows what that is!). When a consumer seeks out certain content, the effort of identifying it among all the noise is prohibitive – so they give up.
- The data that consumers want from the data hub simply isn’t there, because the ingestion of data has been prioritised poorly – often based on what is easy for data producers to provide, rather than what is enabling for consumers.
- Data isn’t exposed using tooling and structures which are accessible to consumers (instead relying on the support organisation to provide bespoke modelling and reporting on a request, not self-service, basis – the whole point!)
These issues can all be fixed after the fact. But they’re so much easier and cheaper to address at the architecture and planning stage ahead of any work to implement the data hub. This starts by having a clear understanding and articulation at the outset of how a data hub will be used to solve specific business problems, and what is needed for those use cases to succeed.
Hence, when we hear my clients talking about building a data hub, one of our first questions is “what for?”
The answer tells me a lot about how likely they are – much later – to be glad they did.