Data Lakes – Salvation from Data Integration Complexity?

Data Lakes – Salvation from Data Integration Complexity?

I’ve been in the IT industry for 20 years and have encountered all too many proposals to solve the enterprise’s data integration nightmare.

In the early 90’s there was an book with an intriguing title. Orfali and Harkey had been publishing a series of books that had cartoonish aliens on the cover. They had a book called “The Essential Distributed Objects Survival Guide”, which apparently won the 1996 software productivity award. In this book, you will find wild claims about COBRA 2.0 coined “The Intergalactic Object Bus”. This was Client Server technology improved by Object Oriented technology. It was expected that it would untangle the Gordian knot of enterprise integration.

By the 2000’s CORBA was all but forgotten. The World Wide Web had taken over and a new kind of web technology, the concept of Web Services were created and an entire consortium began conjuring up WS-I standards to support the new paradigm dubbed Service Oriented Architecture (SOA). That set of technologies was subsequently replaced by a much simpler technology called ReST. Developers realized that the additional complexity of WS-I was not worth the trouble and reverted back to the native web.

CORBA and WS-I SOA are both solutions to the data integration problem via synchronous remote procedure technology. However, in parallel to these developments, there existed an alternative solution based on asynchronous message passing. These were based on message queueing (MQ), that lead to Message Oriented Middleware (MOM) that eventually led to the notion of the Enterprise Service Bus (ESB).

In parallel to all this were developments in the database realm like Data Warehousing and Virtual/Federated Databases. Over the years, many ways to skin the cat were created, all beginning with point to point solutions, only to realize the benefit of a common canonical/universal mechanism then finally only to fall into the trap of unmanageable complexity. Going from a N^2 implementation to N implementation apparently requires just too much work. Strategies to boil the ocean don’t really work unless that ocean happens to be a pond.

The last 3 years (Hadoop initial release was on December 2011 ) have seen the emergence of a new kind of IT system suited for ‘Big Data’. These are massively parallel compute and storage systems designed to tackle the extremely large data problems. The kind of large data problems that made companies like Google and Facebook possible. Could this new kind of technology the antidote to the ills of enterprise data integration? Big Data systems were not originally designed to tackle the problem of integrating heterogenous IT systems. The IT world continues to marvel at the technology and prowess of web scale companies like Google. Furthermore, the Data Lake approach provides promising hints that Big Data may in fact be the salvation from unmanaged complexity.

Leave a Reply

Your email address will not be published. Required fields are marked *