## The Emerging Information Geometric Approach to Deep Learning

Classical Statistics addresses systems involving large numbers, however Statistics breaks down in a domain of high dimensional inputs and models with a high number of parameters. In this domain, new theories and methods are being developed using new insights discovered though the use of massive computational systems. The field of Deep Learning is spearheading these discoveries, however there is a pressing need to have an overarching framework. Such a framework is at the core of our development efforts at Alluviate.

The study of Deep Learning at its foundations is based on Probability Theory and Information Theory. For a probabilistic treatment, the book “The Elements of Statistical Learning” is suggested. From a Information Theoretic viewpoint, David MacKay’s book and his video lectures are a great place to start (see: Information Theory, Inference, and Learning Algorithms). Joshua Bengio’s upcoming book on Deep Learning also has a dedicated a chapter to cover two fields.

The Count Bayesie blog has a very intuitive tutorial that is worth a quick read. It introduces probability theory and provides a generalization of the equation for expectation :

$$E[X] = \int_{\Omega} X(\omega)P(d\omega) $$

where the author employs the Lebesque Integral that defines probability in a space that could otherwise be non-Euclidean. This is a hint to the realization that probability may not need to defined a non-Euclidean space. If Non-Euclidean then perhaps there may be other Non-Euclidean metrics that could be employed in the study of Deep Learning?

The dynamics of a Neural Network is usually framed in the context of optimizing a convex or non-convex non linear problem. This involves the minimization/maximization of an objective function. The formulation of the objective function is a bit arbitrary but it is typically the squared error between the actual and estimated values:

$$ \sum_x [ \hat{q}(x) – q(x) ]^2 $$

The solution to the optimization problem is typically a simple gradient descent approach. What is surprising here is that Deep Learning systems are highly successful despite such a simple approach. One would have thought that gradient descent would be all too often stuck often many local minima one would expect in a non-convex space. However, the intuition of low dimensions does not convey to higher dimensions, where local minima are actually saddle points and a simple gradient descent can escape given enough patience!

However, without a overarching theory or framework, a lot of the techniques employed in Deep Learning (i.e. SGD, Dropout, Normalization, hyper-parameter search etc) all seem to be arbitrary techniques (see: http://arxiv.org/abs/1206.5533 ).

At Alluviate we build of a Information Theoretic approach with the primary notion of employing metrics that distinguish between an estimated distribution and an actual distribution. We use this knowledge to drive more efficient training.

In Information Theory there is the Kullback-Leibler Divergence $ D_{KL}(p||q) = \sum^x p(x) log \left( \frac {p(x)}{q(x)} \right) $ which is a measure of the difference between two probability distributions. (Note: Shannon’s Entropy is a special case of the KL divergence where q is constant). If one takes a distribution and its infinitesimal difference, one arrives as the following equation:

$$ D_{KL}(p_{\theta}||q_{\theta + \delta\theta}) = g_{ij}\Delta\theta^{i}\Delta\theta^{j} + O\delta\theta^3 $$

where $ g_{ij} $ is the Fisher Information Matrix (FIM):

$$ g_{ij} = – \sum_x P_{\theta}(x) \frac{\partial}{\partial\theta^i}\frac{\partial}{\partial\theta^j} log P_{\theta}(x) $$

The Cramér–Rao lower bound is an estimate of the lower bound of the variance of an estimator. It is related to the FIM $ I(\theta) $ in scalar form:

$$ Var( \hat{\theta} ) >= \frac {1}{I(\theta)} $$

So the above equation that the FIM has an effect on minimizing the variance between estimated and actual values.

There exists a formulation by Sun-Ichi Amari in a field called Information Geometry that casts the FIM as a metric. Amari shows in his paper “Natural Gradient works Efficiently in Learning“, and speculates that natural gradient may more effectively navigate out of plateaus than conventional stochastic gradient descent. The FIM Information Geometry shares some similarity with Einstein’s General Theory of Relativity in that the dynamics of a system follows a non-Euclidean space. So rather than observing the curvature of light as a consequence of gravity, one would find a curvature of information in the presence of knowledge.

Although the Information Geometry theory is extremely elegant, the general difficulty with the FIM is that is is expensive to calculate. However recent developments (all in 2015) have shown various approaches to calculating an approximation that leads to very encouraging results.

Parallel training of DNNs with Natural Gradient and Parameter Averaging from the folks developing the Speech Recognizer Kaldi have developed a stochastic gradient technique that employs an approximation of the FIM. Their technique not only improves over standard SGD, but allows for parallelization.

Youshua Bengio and his team at the University of Montreal have a paper Topmoumoute online natural gradient algorithm TONGA have developed a low-rank approximation of FIM with an implementation that beats stochastic gradient in speed and generalization.

Finally Google’s DeepMind team have published a paper “Natural Neural Networks“. In this paper they describe a technique that reparameterizes the neural network layers so that the FIM is effectively the identity matrix. It is a novel technique that has similarities to the Batch Normalization that was previously proposed.

We still are in an early stage for a theory of Deep Learning using Information Geometry, however recent developments seem show the promise of employing a more abstract theoretical approach. Abstract mathematics like Information Geometry should not be dismissed as impractical to implement but rather used as a guide towards building better algorithms and analytic techniques. As in High Energy physics research, there is undeniable value in the interplay between the theoretical and the experimental physicists.

For more information on this approach please see: A Pattern Language for Deep Learning.