Browsed by
Month: June 2015

The Seven I’s of Big Data Science Methodology

The Seven I’s of Big Data Science Methodology

Industry press is enamored by the 4 V’s of Big Data. These are Volume, Velocity, Variety and Veracity. Volume referring to the size of the data. Velocity referring to the speed of how data is collected and consumed. Variety referring to the different kinds of data consumed, from structured data, unstructured data and sensor data. Veracity referring to the trustworthiness of the data.

IBM (source) has a nice infographic that highlights the problem space of Big Data:

IBM Big Data Four V's

Readers however are left perplexed as how best to discover value given these tremendous data complexities. Alluviate is focused on formulating a Big Data Methodology (specifically we call it The Data Lake Methodology) for creating valuable and actionable insight. There certainly can be other Big Data methodologies, but we believe that the Data Lake approach leads to a more agile and lean process.

The current standard process for Data Mining is CRISP-DM. It involves 5 major phases:

The core value of Big Data is not just the ability to store lots of data in cheap commodity hardware. The real value, which many vendors seem to have missed the point entirely, is the ability to process that data to gain insight. This is Data Mining in the traditional sense and it is Machine Learning in the more advanced sense. Therefore, as a starting CRISP-DM is a good starting point for defining a new process. This new process however needs to get an upgrade. That is, we can’t ignore advanced in technology since 1999 when CRISP-DM was originally defined.

A Data Lake approach to Big Data has the following features:

  • The ingestion step requires zero ceremony. There is no need for upfront schema design or ETL development.
  • Access to analytics is democratized by providing ease of use self-service tools.
  • The process is entirely incremental and iterative, rather than a boil the ocean approach of previous datawarehousing.
  • The approach employs Big Data processing to creates data profile and provides recommendation to support the discovery of insight.

Given this context of a new kind of way of doing data integration, I propose the Seven I’s of the Data Lake Methodology:

  1. Ingest
    – Ingest the data sources that you require. Data that comes from multiple data silos are placed in a single name space to allow ease of exploration.
  2. Integrate
    – Data that is ingested from multiple source can gain immense value through data linkage and fusion processes. This usually requires a combination of massive process and machine learning. Other aspects of integration may include the necessary data preparation required for downstream analytics processes like predictive analytics or OLAP cubes. Essentially, we are leveraging Big Data technologies to prepare data to make it easier for humans to explore and investigate Big Data.
  3. Index
    – The Data ingested and integrated inside a Data Lake needs to be further processed into structures that enables the user to explore the Data Lake in a responsive manner. Technologies like search engines (inverted indices) and OLAP databases provide the capability for users to slice and dice through the data. It is not enough to provide a SQL interface into the Hadoop file system.
  4. Investigate
    – This is the process of exploring through data building an understanding and creating of models the data. This phase is enhanced by the previous Index phase by accelerating the number of iterations of investigations that can be performed.
  5. Discover Insight
    – When valuable insight is uncovered, the process requires that this insight be validated. The thought here is that the process should lean towards rapidly discover the Minimum Valuable Insight. This is analogous to the idea of Minimum Viable Product.
  6. Invest
    – Insights are a nice to have, but can’t be made valuable unless an enterprise invests energy and resources to act on that insight. This requires that insights discovered are implemented and deployed into the organization.
  7. Iterate
    – This is actually a “meta” phase for the entire process, but I prefer the number seven over the number six. It is of course important to emphasize the point that it is not enough to deploy your analytics solution just once. The world is a dynamic place and lots of things change. You world will operate in such a way that the original veracity of your source will change and new models will have to be built to compensate and adapt.

I hope that the Big Data industry will move beyond discussing the problem of the Four V’s of Big Data. The Seven I’s of Big Data Science instead focuses on the more valuable process of discovering actionable insight. This is a better place to revolve the discussion around.