Browsed by
Category: general

A Pattern Language for Deep Learning

A Pattern Language for Deep Learning

Pattern Languages are languages derived from entities called patterns that when combined form solutions to complex problems. Each pattern describes a problem and offers solutions. Pattern languages are a way of expressing complex solutions that were derived from experience such that others can gain a better understanding of the solution.

Pattern Languages were originally promoted by Christopher Alexander to describe the architecture of businesses and towns. These ideas where later adopted by Object Oriented Programming (OOP)practitioners to describe the design of OOP programs, these were named Design Patterns. These were extended further into other domains like SOA ( and High Scalability (

In the domain of Machine Learning (ML) there is an emerging practice called “Deep Learning”. In ML there are many new terms that one encounters such as Artificial Neural Networks, Random Forests, Support Vector Machines and Non-negative Matrix Factorization. These however usually refer to a specific kind of algorithm. Deep Learning (DL) however is not really one kind of algorithm, rather it is a whole class of algorithms that tend to exhibit similar ‘patterns’. DL systems are Artificial Neural Networks (ANN) that are constructed with multiple layers (sometimes called Multi-level Perceptrons). The idea is not entirely new, since it was first proposed back in the 1960s.. However, interest in the domain has exploded with the help of advancing hardware technology (i.e. GPU). Since 2011, DL systems have been exhibiting impressive results in the field.

The confusion with DL arises when one realizes that there actually many implementations and it is not just a single kind of algorithm. There are the conventional Feed forward Networks (aka. Fully Connected Networks), Convolution Networks (ConvNet), Recurrent Neural Networks (RNN) and less used Restricted Boltzmann Machines (RBM). They all share a common trait in that these networks are constructed using a hierarchy of layers. One common pattern for example is the employment of differentiable layers, this constraint on the construction of DL systems leads to an incremental way to evolve the network into something that learns classification. There are many such patterns that have been discovered recently and it would be very useful for practitioners to have at their disposal a compilation of these patterns. In the next few weeks we will be sharing more details of this Pattern Language.

Pattern languages are an ideal vehicle for describing and understanding Deep Learning. One would like to believe the Deep Learning has a solid fundamental foundation based on advanced mathematics. Most academic research papers will conjure up high-falutin math such as path integrals, tensors, Hilbert spaces, measure theory etc. but don’t let the math distract oneself from the reality that our understanding is minimal. Mathematics you see has its inherent limitations. Physical scientists have known this for centuries. We formulate theories in such a way that the structures are mathematically convenient. The Gaussian distribution for example is prevalent not because its some magical construct that reality has gifted to us. It is prevalent because it is mathematically convenient.

Pattern languages have been leveraged in many fuzzy domains. The original pattern language revolved around the discussion of architecture (i.e. buildings and towns). There are pattern languages that focus on user interfaces, on usability, on interaction design and on software process. These all don’t have concise mathematical underpinnings yet we do extract real value from these pattern languages. In fact, the specification of a pattern language is not too far off from the creation of a new algebra in mathematics. Algebras are strictly consistent but they are purely abstract and may not need to have any connection with reality. Pattern languages are however connected with reality, however consistency rules are more relaxed. In our attempt to understand the complex world of machine learning (or learning in general) we cannot always leap frog into mathematics. The reality may be such that our current mathematics are woefully incapable of describing what is happening.

Visit for ongoing updates.



Microglia: A Biologically Plausible Basis for Back-Propagation

Microglia: A Biologically Plausible Basis for Back-Propagation

The Perceptron, the basis of Artificial Neural Networks (ANN), was conceived in 1957:


It is of course an outdated model of how the neurons actually work.  The current neural network research and development is more driven by mathematically techniques that ensure continuity and convergence rather than anything biological inspired.

However there are other research groups like Numenta and IBM TrueNorth that are investigating more biologically inspired systems.   These systems are referred to as “Spiking Neural Networks (SNN)” (see: ) or alternatively “Neuromorphic computing”.


These SNN have unfortunately not proven to be as effective as their less biologically inspired cousins (ANN).   In a recent paper (see: ) however,  IBM TrueNorth has shown competitive results simulating a Convolution Network.   This is direct evidence that an “integrate-and-spike” mechanism has the similar computational capability as the more proven ANNs.   The IBM paper however highlighted one major weakness of SNN.  That is, training of the TrueNorth system required simulation of back-propagation using another conventional GPU:

Training was performed offline on conventional GPUs, using a library of custom training layers built upon functions from the MatConvNet toolbox. Network specication and training complexity using these layers is on par with standard deep learning.


Said in a different way, using “back-propagation”.  Biological inspired SNNs seem to lack a mechanism to receive feedback.  Although it had been previously conjectured that such a mechanism was not necessary.

Prior to the invention of  machine powered flight, many people could observe what a bird does to fly.  They can point out the flapping of wings, the large ratio of the size of the wings to the body, to the weight of the bird or the presence of feathers.   However, none of these features leads one to the actual mechanics of flight.  This is one of the arguments against biologically inspired research.  However birds and planes are able to fly because the dynamics are the same.  The airflow under a wing has a higher density than above the wing and thus creating an upward pressure. 

There is a commonality with the brain and neural networks is the fact that they are both dynamical systems.   ANN researchers have observed that if we assign weights to input signals,   multiply the signals with the weights and sum up the results then we have a NN that can perform pretty impressive pattern classification.     The discovery of the weights is done through what is called “training”  and this is done by adjusting all the weights slightly in a way that reduces the observed error in the pattern classification.  Learning is achieved when the observed error settles to one that is consistently acceptable.  This training mechanism is what is called “Back-propagation”.

There are many variants of “back-propagation”,  the most common is gradient descent with a variant called RProp (see: ) which is an extreme simplification that uses only the sign of the gradient to perform its update.  Natural Gradient based methods that are second order update mechanism an interesting variant called NES ( employs genetic evolution methods.  Field Alignment is another simplistic method that is extremely efficient ( see: ).  In general,  back-propagation does not necessarily require that the implementation is performed by a strict application of an analytic gradient calculation.  What is essential is that there is some approximation of an appropriate weight change update and a corresponding structure to propagate the updates.  Incidentally, recent research (see: ) appears to conclude that the magnitude of gradient update isn’t as import as the sign of the update.

There however has been no biological evidence of a structural mechanism of “back-propagation” in biological brains.  Yoshua Bengio published a paper in 2015 (see: ) “Towards Biologically Plausible Deep Learning”.  The investigation attempts to explain a mechanism for back-propagation exists in Spike-Timing-Dependent Plasticity (STDP) of biological neurons.   It is however questionable whether neurons are  able to learn by themselves without the need of an external feedback pathway that spans multiple layers.

There is however an alternative mechanism that recently has been discovered that may be a more convincing argument that is based on a structure that is independent of the brain’s neurons.   There is a large class of cells in the Brain called Microglia ( see: ) that are responsible for regulating the neurons and their connectivity.

It turns out that in the absence of chemicals released by glia, the neurons committed the biochemical version of suicide. Barres also showed that the astrocytes appeared to play a crucial role in forming synapses, the microscopic connections between neurons that encode memory. In isolation, neurons were capable of forming the spiny appendages necessary to reach the synapses. But without astrocytes, they were incapable of connecting to one another.

Research on the nature of the Microglia has been ignored until very recently:

Hardly anyone believed him. When he was a young faculty member at Stanford in the 1990s, one of his grant applications to the National Institutes of Health was rejected seven times. “Reviewers kept saying, ‘Nah, there’s no way glia could be doing this,’” Barres recalls. “And even after we published two papers in Science showing that [astrocytes] had profound, almost all-or-nothing effects in controlling synapses’ formation or synapse activity, I still couldn’t get funded! I think it’s still hard to get people to think about glia as doing anything active in the nervous system.”

In fact, conventional wisdom was that the Lymphatic System did not interface with the brain (see: )

Generations of medical students have been taught the mammalian brain has no connection to the lymphatic system, to help keep the brain isolated and protected from infection. Louveau’s discovery will force a rewrite of anatomy textbooks…

However, recent research ( see: ) has debunked that conventional understanding.


50% of the brain’s volume consists of glia (see: ).  This new model of the brain does provide a more convincing pathway to explaining the notion of “Back-propagation” and does hint at explaining the lack of convincing results of  SNN based systems.  SNN have been formulated with unfortunately only a partial understanding of how the brain works and thus an incomplete model that is missing a regulatory mechanism.

In addition, learning does appear to be to be influenced by the microglia in the brain (see: ).

Microglia processes constantly move as they survey the surrounding environment.

Microglia can modify activity-dependent changes in synaptic strength between neurons that underlie memory and learning using classical immunological signaling pathways…

This is further validated in a more recent comprehensive study (see: ).

Their dynamism and functional capabilities position them perfectly to regulate individual synapses and to be undoubtedly involved in optimizing information processing, learning and memory, and cognition.

Microglia are even able to communicate with neurons by neurotransmitters (see: ) :

The presence of specific receptors on the surface of microglia suggests communication with neurons by neurotransmitters.  Here, we demonstrate expression of serotonin receptors, including 5-HT2a,b and 5-HT4 in microglial cells and their functional involvement in the modulation of exosome release by serotonin.

There are in additional experimental evidence that  sleep, also vital in learning, involves the Glial cells ( see: ):

Scientists, who imaged the brains of mice, showed that the glymphatic system became 10-times more active when the mice were asleep.

Back-propagation, perhaps at work, while we sleep?

In summary, biological brains have a regulatory mechanism in the form of microglia that are highly dynamic in regulating synapse connectivity and pruning neural growth.   The activity is most pronounced during sleep. SNNs have been shown to have inference capabilities equivalent to Convolution Networks.  SNNs however have not shown to effectively learn on their own without a ‘back-propagation’ mechanism.   This mechanism is most plausibly provided by the microglia.

The Emerging Information Geometric Approach to Deep Learning

The Emerging Information Geometric Approach to Deep Learning

Classical Statistics addresses systems involving large numbers, however Statistics breaks down in a domain of high dimensional inputs and models with a high number of parameters. In this domain, new theories and methods are being developed using new insights discovered though the use of massive computational systems. The field of Deep Learning is spearheading these discoveries, however there is a pressing need to have an overarching framework. Such a framework is at the core of our development efforts at Alluviate.

The study of Deep Learning at its foundations is based on Probability Theory and Information Theory. For a probabilistic treatment, the book “The Elements of Statistical Learning” is suggested. From a Information Theoretic viewpoint, David MacKay’s book and his video lectures are a great place to start (see: Information Theory, Inference, and Learning Algorithms). Joshua Bengio’s upcoming book on Deep Learning also has a dedicated a chapter to cover two fields.

The Count Bayesie blog has a very intuitive tutorial that is worth a quick read. It introduces probability theory and provides a generalization of the equation for expectation :

$$E[X] = \int_{\Omega} X(\omega)P(d\omega) $$

where the author employs the Lebesque Integral that defines probability in a space that could otherwise be non-Euclidean. This is a hint to the realization that probability may not need to defined a non-Euclidean space. If Non-Euclidean then perhaps there may be other Non-Euclidean metrics that could be employed in the study of Deep Learning?

The dynamics of a Neural Network is usually framed in the context of optimizing a convex or non-convex non linear problem. This involves the minimization/maximization of an objective function. The formulation of the objective function is a bit arbitrary but it is typically the squared error between the actual and estimated values:

$$ \sum_x [ \hat{q}(x) – q(x) ]^2 $$

The solution to the optimization problem is typically a simple gradient descent approach. What is surprising here is that Deep Learning systems are highly successful despite such a simple approach. One would have thought that gradient descent would be all too often stuck often many local minima one would expect in a non-convex space. However, the intuition of low dimensions does not  convey to higher dimensions, where local minima are actually saddle points and a simple gradient descent can escape given enough patience!

However, without a overarching theory or framework, a lot of the techniques employed in Deep Learning (i.e. SGD, Dropout, Normalization, hyper-parameter search etc) all seem to be arbitrary techniques (see: ).

At Alluviate we build of a Information Theoretic approach with the primary notion of employing metrics that distinguish between an estimated distribution and an actual distribution. We use this knowledge to drive more efficient training.

In Information Theory there is the Kullback-Leibler Divergence $ D_{KL}(p||q) = \sum^x p(x) log \left( \frac {p(x)}{q(x)} \right) $ which is a measure of the difference between two probability distributions. (Note: Shannon’s Entropy is a special case of the KL divergence where q is constant).  If one takes a distribution and its infinitesimal difference, one arrives as the following equation:

$$ D_{KL}(p_{\theta}||q_{\theta + \delta\theta}) = g_{ij}\Delta\theta^{i}\Delta\theta^{j} + O\delta\theta^3 $$

where $ g_{ij} $ is the Fisher Information Matrix (FIM):

$$ g_{ij} = – \sum_x P_{\theta}(x) \frac{\partial}{\partial\theta^i}\frac{\partial}{\partial\theta^j} log P_{\theta}(x) $$

The Cramér–Rao lower bound is an estimate of the lower bound of the variance of an estimator. It is related to the FIM $ I(\theta) $ in scalar form:

$$ Var( \hat{\theta} ) >= \frac {1}{I(\theta)} $$

So the above equation that the FIM has an effect on minimizing the variance between estimated and actual values.

There exists a formulation by Sun-Ichi Amari in a field called Information Geometry that casts the FIM as a metric. Amari shows in his paper “Natural Gradient works Efficiently in Learning“, and speculates that natural gradient may more effectively navigate out of plateaus than conventional stochastic gradient descent.  The FIM   Information Geometry shares some similarity with Einstein’s General Theory of Relativity in that the dynamics of a system follows a non-Euclidean space. So rather than observing the curvature of light as a consequence of gravity, one would find a curvature of information in the presence of knowledge.

Although the Information Geometry theory is extremely elegant, the general difficulty with the FIM is that is is expensive to calculate. However recent developments (all in 2015) have shown various approaches to calculating an approximation that leads to very encouraging results.

Parallel training of DNNs with Natural Gradient and Parameter Averaging from the folks developing the Speech Recognizer Kaldi have developed a stochastic gradient technique that employs an approximation of the FIM. Their technique not only improves over standard SGD, but allows for parallelization.

Youshua Bengio and his team at the University of Montreal have a paper Topmoumoute online natural gradient algorithm TONGA have developed a low-rank approximation of FIM with an implementation that beats stochastic gradient in speed and generalization.

Finally Google’s DeepMind team have published a paper “Natural Neural Networks“. In this paper they describe a technique that reparameterizes the neural network layers so that the FIM is effectively the identity matrix. It is a novel technique that has similarities to the Batch Normalization that was previously proposed.

We still are in an early stage for a theory of Deep Learning using Information Geometry, however recent developments seem show the promise of employing a more abstract theoretical approach.   Abstract mathematics like Information Geometry should not be dismissed as impractical to implement but rather used as a guide towards building better algorithms and analytic techniques.   As in High Energy physics research,  there is undeniable value in the interplay between the theoretical and the experimental physicists.

For more information on this approach please see: A Pattern Language for Deep Learning.

The Seven I’s of Big Data Science Methodology

The Seven I’s of Big Data Science Methodology

Industry press is enamored by the 4 V’s of Big Data. These are Volume, Velocity, Variety and Veracity. Volume referring to the size of the data. Velocity referring to the speed of how data is collected and consumed. Variety referring to the different kinds of data consumed, from structured data, unstructured data and sensor data. Veracity referring to the trustworthiness of the data.

IBM (source) has a nice infographic that highlights the problem space of Big Data:

IBM Big Data Four V's

Readers however are left perplexed as how best to discover value given these tremendous data complexities. Alluviate is focused on formulating a Big Data Methodology (specifically we call it The Data Lake Methodology) for creating valuable and actionable insight. There certainly can be other Big Data methodologies, but we believe that the Data Lake approach leads to a more agile and lean process.

The current standard process for Data Mining is CRISP-DM. It involves 5 major phases:

The core value of Big Data is not just the ability to store lots of data in cheap commodity hardware. The real value, which many vendors seem to have missed the point entirely, is the ability to process that data to gain insight. This is Data Mining in the traditional sense and it is Machine Learning in the more advanced sense. Therefore, as a starting CRISP-DM is a good starting point for defining a new process. This new process however needs to get an upgrade. That is, we can’t ignore advanced in technology since 1999 when CRISP-DM was originally defined.

A Data Lake approach to Big Data has the following features:

  • The ingestion step requires zero ceremony. There is no need for upfront schema design or ETL development.
  • Access to analytics is democratized by providing ease of use self-service tools.
  • The process is entirely incremental and iterative, rather than a boil the ocean approach of previous datawarehousing.
  • The approach employs Big Data processing to creates data profile and provides recommendation to support the discovery of insight.

Given this context of a new kind of way of doing data integration, I propose the Seven I’s of the Data Lake Methodology:

  1. Ingest
    – Ingest the data sources that you require. Data that comes from multiple data silos are placed in a single name space to allow ease of exploration.
  2. Integrate
    – Data that is ingested from multiple source can gain immense value through data linkage and fusion processes. This usually requires a combination of massive process and machine learning. Other aspects of integration may include the necessary data preparation required for downstream analytics processes like predictive analytics or OLAP cubes. Essentially, we are leveraging Big Data technologies to prepare data to make it easier for humans to explore and investigate Big Data.
  3. Index
    – The Data ingested and integrated inside a Data Lake needs to be further processed into structures that enables the user to explore the Data Lake in a responsive manner. Technologies like search engines (inverted indices) and OLAP databases provide the capability for users to slice and dice through the data. It is not enough to provide a SQL interface into the Hadoop file system.
  4. Investigate
    – This is the process of exploring through data building an understanding and creating of models the data. This phase is enhanced by the previous Index phase by accelerating the number of iterations of investigations that can be performed.
  5. Discover Insight
    – When valuable insight is uncovered, the process requires that this insight be validated. The thought here is that the process should lean towards rapidly discover the Minimum Valuable Insight. This is analogous to the idea of Minimum Viable Product.
  6. Invest
    – Insights are a nice to have, but can’t be made valuable unless an enterprise invests energy and resources to act on that insight. This requires that insights discovered are implemented and deployed into the organization.
  7. Iterate
    – This is actually a “meta” phase for the entire process, but I prefer the number seven over the number six. It is of course important to emphasize the point that it is not enough to deploy your analytics solution just once. The world is a dynamic place and lots of things change. You world will operate in such a way that the original veracity of your source will change and new models will have to be built to compensate and adapt.

I hope that the Big Data industry will move beyond discussing the problem of the Four V’s of Big Data. The Seven I’s of Big Data Science instead focuses on the more valuable process of discovering actionable insight. This is a better place to revolve the discussion around.

Data Lakes and the Responsive 21st Century Corporation

Data Lakes and the Responsive 21st Century Corporation

Yammer’s founder is introducing a new way of organizing the corporation. His ideas originate from earlier ideas from the Lean and Agile methodologies of software development. In his Responsive Manifesto he drives a case for a new kind of efficiency that will drive the successful workplaces of the future.

In a article “How Yammer’s Co-founder Impressed Bill Gates“:

Flash forward to 2015, when the future is more unpredictable than ever. The connectivity we’ve achieved over the last decade has changed everything. “We moved from a world of information scarcity to a world of information ubiquity,” Pisoni says. Consumers are learning, sharing, adapting — and changing their expectations more rapidly. “The world formed a giant network. And that has accelerated the pace of change to crescendo.”

By breaking down hierarchy and conducting smaller-scale, cheaper experiments, you can dramatically reduce the cost of failure and ultimately make your process both more responsive and more efficient.

The Responsive Manifesto
declared the following principles:

  • Purpose over Profit
  • Empowering over Controlling
  • Emergence over Planning
  • Networks over Hierarchies
  • Adaptivity over Efficiency
  • Transparency over Privacy

So let me explain the value of a Data Lake strategy in the context of enabling a more responsive organization:

Empowering over Controlling

The legacy Database administration organization limited the control and understanding of data to a few experts. This lead to an extremely time consuming and rigid process that required the continuous participation of the gatekeepers of the data.

Today, circumstances and markets change rapidly as information flows faster. A Data Lake’s self-service capability enables employees with the best insight and decision-making ability to easily access data of the company to gain better insight. Rather than controlling data through process and hierarchy, you achieve better results by empowering people at the edges.

Emergence over Planning

In a highly unpredictable environment, plans start losing value the moment they’re finished. Embracing agile methods that encourage experimentation and fuel rapid learning is a much better investment that spending too much time upfront planning.

In Data Lake’s the upfront investment typically found in Data Warehouse deployments of designing a universal canonical schema is done away with. The costs to continuously update this schema and corresponding ETL scripts in the an rapidly changing environment is removed. Data is ingested in a Data Lake in the most automated way possible. This empowers people at the edge to rapidly gain insight on new data.

Networks over Hierarchies

Data Lakes provide technology and connectivity to increase the ability to self-organize, collaborating more easily across internal and external organizational boundaries. Typical enterprise “Data Silos” are demolished as all data is made available in a single BigData store.

Adaptivity over Efficiency

A Data Lake is designed for change and continuous learning. Rather than seeking consistency, adaptive systems increase learning and experimentation, in the hopes that one novel idea, product, or method will be the one we need in the new world.

Transparency over Privacy

An enterprise has its data guarded by many different organizations. Data is hard to come by and hard to disseminate across the organization. A Data Lake provides access to data across silos because it is impossible to predict which data might be useful.

Data Lakes – Salvation from Data Integration Complexity?

Data Lakes – Salvation from Data Integration Complexity?

I’ve been in the IT industry for 20 years and have encountered all too many proposals to solve the enterprise’s data integration nightmare.

In the early 90’s there was an book with an intriguing title. Orfali and Harkey had been publishing a series of books that had cartoonish aliens on the cover. They had a book called “The Essential Distributed Objects Survival Guide”, which apparently won the 1996 software productivity award. In this book, you will find wild claims about COBRA 2.0 coined “The Intergalactic Object Bus”. This was Client Server technology improved by Object Oriented technology. It was expected that it would untangle the Gordian knot of enterprise integration.

By the 2000’s CORBA was all but forgotten. The World Wide Web had taken over and a new kind of web technology, the concept of Web Services were created and an entire consortium began conjuring up WS-I standards to support the new paradigm dubbed Service Oriented Architecture (SOA). That set of technologies was subsequently replaced by a much simpler technology called ReST. Developers realized that the additional complexity of WS-I was not worth the trouble and reverted back to the native web.

CORBA and WS-I SOA are both solutions to the data integration problem via synchronous remote procedure technology. However, in parallel to these developments, there existed an alternative solution based on asynchronous message passing. These were based on message queueing (MQ), that lead to Message Oriented Middleware (MOM) that eventually led to the notion of the Enterprise Service Bus (ESB).

In parallel to all this were developments in the database realm like Data Warehousing and Virtual/Federated Databases. Over the years, many ways to skin the cat were created, all beginning with point to point solutions, only to realize the benefit of a common canonical/universal mechanism then finally only to fall into the trap of unmanageable complexity. Going from a N^2 implementation to N implementation apparently requires just too much work. Strategies to boil the ocean don’t really work unless that ocean happens to be a pond.

The last 3 years (Hadoop initial release was on December 2011 ) have seen the emergence of a new kind of IT system suited for ‘Big Data’. These are massively parallel compute and storage systems designed to tackle the extremely large data problems. The kind of large data problems that made companies like Google and Facebook possible. Could this new kind of technology the antidote to the ills of enterprise data integration? Big Data systems were not originally designed to tackle the problem of integrating heterogenous IT systems. The IT world continues to marvel at the technology and prowess of web scale companies like Google. Furthermore, the Data Lake approach provides promising hints that Big Data may in fact be the salvation from unmanaged complexity.

Alluviate – The Origin of the Word

Alluviate – The Origin of the Word

The Merriam Webster dictionary includes an adjective:


adjective al·lu·vi·al \ə-ˈlü-vē-əl\    geology : made up of or found in the materials that are left by the water of rivers, floods, etc.

a noun:


noun al·lu·vi·on \ə-ˈlü-vē-ən\

:  the wash or flow of water against a shore
:  alluvium
:  an accession to land by the gradual addition of matter (as by deposit of alluvium) that then belongs to the owner of the land to which it is added; also:  the land so added.
and the verb:


verb al·lu·vi·ate \-vēˌāt, usually -ād + V\
transitive verb
:  to cover with alluvium <an alluviated valley>
intransitive verb
:  to deposit alluvium

We would like to interpret is as the process that discovers nuggets of insight from a large body of data.  Such as what they did in the old days when they panned for alluvial gold: