Browsed by
Month: February 2015

Data Lakes and the Responsive 21st Century Corporation

Data Lakes and the Responsive 21st Century Corporation

Yammer’s founder is introducing a new way of organizing the corporation. His ideas originate from earlier ideas from the Lean and Agile methodologies of software development. In his Responsive Manifesto he drives a case for a new kind of efficiency that will drive the successful workplaces of the future.

In a article “How Yammer’s Co-founder Impressed Bill Gates“:

Flash forward to 2015, when the future is more unpredictable than ever. The connectivity we’ve achieved over the last decade has changed everything. “We moved from a world of information scarcity to a world of information ubiquity,” Pisoni says. Consumers are learning, sharing, adapting — and changing their expectations more rapidly. “The world formed a giant network. And that has accelerated the pace of change to crescendo.”

By breaking down hierarchy and conducting smaller-scale, cheaper experiments, you can dramatically reduce the cost of failure and ultimately make your process both more responsive and more efficient.

The Responsive Manifesto
declared the following principles:

  • Purpose over Profit
  • Empowering over Controlling
  • Emergence over Planning
  • Networks over Hierarchies
  • Adaptivity over Efficiency
  • Transparency over Privacy

So let me explain the value of a Data Lake strategy in the context of enabling a more responsive organization:

Empowering over Controlling

The legacy Database administration organization limited the control and understanding of data to a few experts. This lead to an extremely time consuming and rigid process that required the continuous participation of the gatekeepers of the data.

Today, circumstances and markets change rapidly as information flows faster. A Data Lake’s self-service capability enables employees with the best insight and decision-making ability to easily access data of the company to gain better insight. Rather than controlling data through process and hierarchy, you achieve better results by empowering people at the edges.

Emergence over Planning

In a highly unpredictable environment, plans start losing value the moment they’re finished. Embracing agile methods that encourage experimentation and fuel rapid learning is a much better investment that spending too much time upfront planning.

In Data Lake’s the upfront investment typically found in Data Warehouse deployments of designing a universal canonical schema is done away with. The costs to continuously update this schema and corresponding ETL scripts in the an rapidly changing environment is removed. Data is ingested in a Data Lake in the most automated way possible. This empowers people at the edge to rapidly gain insight on new data.

Networks over Hierarchies

Data Lakes provide technology and connectivity to increase the ability to self-organize, collaborating more easily across internal and external organizational boundaries. Typical enterprise “Data Silos” are demolished as all data is made available in a single BigData store.

Adaptivity over Efficiency

A Data Lake is designed for change and continuous learning. Rather than seeking consistency, adaptive systems increase learning and experimentation, in the hopes that one novel idea, product, or method will be the one we need in the new world.

Transparency over Privacy

An enterprise has its data guarded by many different organizations. Data is hard to come by and hard to disseminate across the organization. A Data Lake provides access to data across silos because it is impossible to predict which data might be useful.

Data Lakes – Salvation from Data Integration Complexity?

Data Lakes – Salvation from Data Integration Complexity?

I’ve been in the IT industry for 20 years and have encountered all too many proposals to solve the enterprise’s data integration nightmare.

In the early 90’s there was an book with an intriguing title. Orfali and Harkey had been publishing a series of books that had cartoonish aliens on the cover. They had a book called “The Essential Distributed Objects Survival Guide”, which apparently won the 1996 software productivity award. In this book, you will find wild claims about COBRA 2.0 coined “The Intergalactic Object Bus”. This was Client Server technology improved by Object Oriented technology. It was expected that it would untangle the Gordian knot of enterprise integration.

By the 2000’s CORBA was all but forgotten. The World Wide Web had taken over and a new kind of web technology, the concept of Web Services were created and an entire consortium began conjuring up WS-I standards to support the new paradigm dubbed Service Oriented Architecture (SOA). That set of technologies was subsequently replaced by a much simpler technology called ReST. Developers realized that the additional complexity of WS-I was not worth the trouble and reverted back to the native web.

CORBA and WS-I SOA are both solutions to the data integration problem via synchronous remote procedure technology. However, in parallel to these developments, there existed an alternative solution based on asynchronous message passing. These were based on message queueing (MQ), that lead to Message Oriented Middleware (MOM) that eventually led to the notion of the Enterprise Service Bus (ESB).

In parallel to all this were developments in the database realm like Data Warehousing and Virtual/Federated Databases. Over the years, many ways to skin the cat were created, all beginning with point to point solutions, only to realize the benefit of a common canonical/universal mechanism then finally only to fall into the trap of unmanageable complexity. Going from a N^2 implementation to N implementation apparently requires just too much work. Strategies to boil the ocean don’t really work unless that ocean happens to be a pond.

The last 3 years (Hadoop initial release was on December 2011 ) have seen the emergence of a new kind of IT system suited for ‘Big Data’. These are massively parallel compute and storage systems designed to tackle the extremely large data problems. The kind of large data problems that made companies like Google and Facebook possible. Could this new kind of technology the antidote to the ills of enterprise data integration? Big Data systems were not originally designed to tackle the problem of integrating heterogenous IT systems. The IT world continues to marvel at the technology and prowess of web scale companies like Google. Furthermore, the Data Lake approach provides promising hints that Big Data may in fact be the salvation from unmanaged complexity.

Alluviate – The Origin of the Word

Alluviate – The Origin of the Word

The Merriam Webster dictionary includes an adjective:

alluvial

adjective al·lu·vi·al \ə-ˈlü-vē-əl\    geology : made up of or found in the materials that are left by the water of rivers, floods, etc.

a noun:

alluvion

noun al·lu·vi·on \ə-ˈlü-vē-ən\

:  the wash or flow of water against a shore
:  alluvium
:  an accession to land by the gradual addition of matter (as by deposit of alluvium) that then belongs to the owner of the land to which it is added; also:  the land so added.
and the verb:

alluviate

verb al·lu·vi·ate \-vēˌāt, usually -ād + V\
transitive verb
:  to cover with alluvium <an alluviated valley>
intransitive verb
:  to deposit alluvium

We would like to interpret is as the process that discovers nuggets of insight from a large body of data.  Such as what they did in the old days when they panned for alluvial gold:
panninggold