# Deep Learning – The Unreasonable Effectiveness of Randomness

The paper submissions for ICLR 2017 in Toulon France deadline has arrived and instead of a trickle of new knowledge about Deep Learning we get a massive deluge. This is a gold mine of research that’s hot off the presses. Many papers are incremental improvements of algorithms of the state of the art. I had hoped to find more fundamental theoretical and experimental results of the nature of Deep Learning, unfortunately there were just a few. There was however 2 developments that were mind boggling and one paper that is something I’ve been suspecting for a while now and has finally been confirm to shocking results. It really is a good news, bad new story.

First let’s talk about the good news. The first is the mind boggling discovery that you can train a neural network to learn to learn (i.e. meta-learning). More specifically, several research groups have trained neural networks to perform stochastic gradient descent (SGD). Not only have they been able to demonstrate neural networks that have learned SGD, the networks have performed better than any hand tuned human method! The two papers that were submitted were”Deep Reinforcement Learning for Accelerating the Convergence Rate” and “Optimization as a Model for Few-Shot Learning” . Unfortunately though, these two groups have been previously scooped by Deep Mind, who showed that you could do this in this paper “Learning to Learn by gradient descent by gradient descent“. The two latter papers trained an LSTM, while the first one trained via RL. I had thought that it would take a bit longer to implement meta-learning, but it has arrived much sooner than I had expected!

Not to be out-done, two other groups created machines that could design new Deep Learning networks and do it in such a way as to improve on the state-of-the-art! This is learning to design neural networks. The two papers that were submitted are “Designing Neural Network Architectures using Reinforcement Learning” and “Neural Architecture Search with Reinforcement Learning”. The former paper describes the use of Reinforcment Q-Learning to discover CNN architectures. You can find some of their generated CNNs in Caffe here: https://bowenbaker.github.io/metaqnn/ . The latter paper is truly astounding (you can’t do this without Google’s compute resources). Not only did they show state-of-the-art CNN networks, the machine actually learned a few more variants of the LSTM node! Here are the LSTM nodes the machine created (left and bottom):

So not only are researcher who hand optimize gradient descent solutions out of business, so are folks who make a living designing neural architectures! This is actually just the beginning of Deep Learning systems just bootstrapping themselves. So I must now share Schmidhuber’s cartoon that aptly describes what is happening:

This is absolutely shocking and there’s really no end in sight as to how quickly Deep Learning algorithms are going to improve. This meta capability allows you to apply it on itself, recursively creating better and better systems.

Permit me now to deal you the bad news. Here is the paper that is the bearer of that news: “Understanding Deep Learning required Rethinking Generalization“. I’ve thought about Generalization a lot, and I’ve posted out some queries in Quora about Generalization and also about Randomness in the hope that someone could give some good insight. Unfortunately, nobody had enough of an answer or understood the significance of the question until the folks who wrote the above paper performed some interesting experiments. Here is a snippet of what they had found:

1. The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.

2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.

3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

The shocking truth revealed. Deep Learning networks are just massive associative memory stores! Those high-dimensional manifolds that are supposedly created, they appear to be figments of our imagination! That’s why adversarial features work, there aren’t any continuously navigable manifolds! It is kind of like when we discovered the nucleus and the electron, there’s mostly empty space between them. Atoms aren’t hard spheres!

John Hopcroft wrote a paper early this year examining the duality of Neural Networks and Associative Memory. Here’s a figure from his paper:

The “Rethinking Generalization” paper goes even further by examining our tried and true tool for achieving Generalization (i.e. Regularization) and finds that:

Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.

In other words, all our regularization tools may be less effective than what we believe! Furthermore, even more shocking, the unreasonable effectiveness of SGD turns out to be:

Appealing to linear models, we analyze how SGD acts as an implicit regularizer.

just a different kind of regularization that just happens to work!

In fact, a paper submitted for ICLR2017 by another group titled “An Empirical Analysis of Deep Network Loss Surfaces” confirms that the local minima of these networks are different:

Our experiments show that different optimization methods find different minima, even when switching from one method to another very late during training. In addition, we found that the minima found by different optimization methods have different shapes, but that these minima are similar in terms of the most important measure, generalization accuracy.

Which tells you that your choice of learning algorithm “rigs” how it arrives at a solution. Randomness is ubiquitous and it does not matter how you regularize your network or what the SGD variant that you employ, the network just seems to evolve (if you set the right random conditions) towards convergence! What are the properties of SGD that leads to machines that can learn? Are the properties tied to differentiation or is it something more general? If we can teach a network to perform SGD, can we teach it to perform this unknown generalized learning method?

The effectiveness of this randomness was in fact demonstrated earlier this year in a paper: “A Powerful Generative Model Using Random Weights for the Deep Image Representation” also co-authored by John Hopcroft that showed that you could generate realistic imagery using randomly initialized networks without any training! How could this be possible?

Therefore to understand Deep Learning, we must embrace randomness. Randomness arises from maximum entropy, which interestingly enough is not without its own structure! The strangeness here is that Randomness is ubiquitous in the universe. The arrow of time is reflected by the direction towards greater entropy. How then is it that this property is also the basis of learning machines?

Please see Design Patterns for Deep Learning: Canonical Patterns for additional insight on this intriguing subject.