Browsed by
Category: deeplearning

Three Architecture Ilities found in Deep Learning Systems

Three Architecture Ilities found in Deep Learning Systems

People in software development are familiar with the phrase “-ilities”. It is actually not a word, but you can google it:

Informally these are sometimes called the “ilities”, from attributes like stability and portability. Qualities — that is non-functional requirements — can be divided into two main categories: Execution qualities, such as security and usability, which are observable at run time.

Non-functional requirement — Wikipedia

You may think because it is not a real world, that it is some informal convention, some kind of loose jargon. This is actually not the case, software quality attributes have in fact been formalized in ISO 9126:

Quality attributes are realized non-functional requirements used to evaluate the performance of a system. These are informally called “ilities” after the suffix that of many of the words share. In software architecture, there is a notion of “ilities” that are qualities that are important in evaluating our solutions. Lacking in DL literature is enough of an understanding of how to evaluate quality of a Deep Learning architecture. What then are the “ilities” that are specific to evaluating Deep Learning systems?

Despite the newness of the field, there are 3 main “ilities” that a practitioner should know of:

Expressibility — This quality describes how well a machine can approximate functions. One of first questions that many research papers have tried to answer is “Why does a Deep Learning system need to be Deep?” Another way of saying this is, what is the importance of having multiple layers or a hierarchy of layers. There is some consensus in the literature that deeper networks require less parameters than shallow, wider networks to express the same function. You can find more detail of the various explanations here: http://www.deeplearningpatterns.com/doku.php/hierarchical_abstraction. The measure here appears to be, how few parameters (i.e. weights) do we need to effectively create a function approximator. A related research are here is weight quantization, how few bits does one need and not lose precision.

Trainability — The other kind of research that gets published is on how well can a machine learn. You will find hundreds of papers that all try to out do each other by showing how trainable their system is as compared to the ‘state-of-the-art’. The open theoretical question here is why do these systems even learn at all? The reason this is not obvious is because the work horse of Deep Learning, the stochastic gradient descent (SGD) algorithm, appears absurdly too simplistic to even possibly work! There is a conceptual missing link here that researchers have yet to identify.

Generalizability — This is a quality that describes how well a trained machine can perform predictions on data that it has not seen before. I’ve written about this in more detail in “Rethinking Generalization” where I do describe 5 ways to measure generalization. I think that everyone seems talks about generalization, unfortunately few have a good handle on how to measure it.

In computer science, we do understand expressibility. This is its most general from is the notion of “Turing Completeness” or “Universal Computation” (see: “Simplicity of Universal Machines”. Feed-forward networks and Convolution Networks are for example not turing complete simply because the don’t have memory. What Deep Learning brings to the table that is wildly radical from conventional computer science is the latter two capabilities.

Trainability, the ability to train a computer, rather than program a computer is a major capability. This is “automating automation”. In other words, you don’t need to provide specific detailed instructions, but rather you just need to provide the machine examples of what it needs to do. We’ve actually seen this before in the difference between imperative versus declarative programming. The difference however in Deep Learning (or Machine Learning), we don’t need to define the rules. The machine is able to discover the rules for itself.

Even better, Generalization implies that if the machine, once trained, encounters situations where it has not been shown an example before, is able to figure out how to make the correct prediction. Generalization implies that even after discovering the rules after training, it is now able to create new rules on its own for unexpected situations. The machine has become more adaptable.

These ilities tie in with the “5 Capability Level of Deep Learning”. At each level we can explore the nature of expressibility, trainability and generalizability we require to achieve that level. So as an example, we can look at machines with the Classification with Memory. What does the additional memory component add to expressibility, trainability and generalizability. In the case for expressibility, we can see that memory permits a machine to perform translation instead of just classification. In terms of trainability, we had to come with additional mechanism to learn how to update memory. Finally, for generalizability we need to use other kind of benchmarks (i.e. BLEU, bAbl) to perform evaluations on this kind of system. At every capability level, we need to re-explore how we achieve each of these 3 ilities.

Ideally we would like to see a framework where one understands how to compose various building blocks driven by an understanding as to how each block contributes to trainability, expressivity or generalization. Deep Learning is still very young in that we have few tools to evaluate the effectiveness of our solutions. Additionally, other ilities such as interpretability, transferability, latency, adversarial stability and security are worth exploring.

Five Capability Levels of Deep Learning Systems

Five Capability Levels of Deep Learning Systems

has a good short article on “Understanding the four types of AI, from reactive robots to self-aware beings” where he

Reactive Machine — The most basic type that is unable to form memories and use past experiences to inform decisions. They can’t function outside the the specific tasks that they were designed for.

Limited Memory — Are able to look into the past to inform current decisions. The memory however is transient and aren’t used for future experiences.

Theory of Mind — These systems are able to form representations of the world as well as other agents that it interacts with.

Self-Awareness- Mostly speculative description here.

I like this classification much better than the “Narrow AI” and “General AI” dichotomy. This classification makes an attempt to break down Narrow AI into 3 categories. This gives us more concepts to differentiate different AI implementations. My reservation though of the definition is that they appear to come from a GOFAI mindset. Furthermore, the leap from limited memory able to employ the past to theory of mind seems to be an extremely vast leap.

I however would like to take this opportunity to come up with my own classification, more targeted towards the field of Deep Learning. I hope my classification is a bit more concrete and helpful for practitioners. This classification gives us a sense of where we currently are and where we might be heading.

We are inundated with all the time with AI hype that we fail to good conceptual framework for making a precise assessment of the current situation. This may simply be due to the fact that many writers have trouble keeping up with the latest development in Deep Learning research. There’s too much to read to keep up and the latest discoveries continue to change our current understanding. See “Rethinking Generalization” as one of those surprising discoveries.

Here I introduce a pragmatic classification of Deep Learning capabilities:

  1. Classification Only (C)

This level includes the fully connected neural network (FCN) and the convolution network (CNN) and various combinations of them. These system take a high dimensional vector as input and arrive at a single result, typically a classification of the input vector. You can consider these systems as being stateless functions, meaning that their behavior is only a function of the current input. Generative models are one of those hotly researched areas and these also belong to this category. In short, these systems are quite capable by themselves.

2. Classification with Memory (CM)

This level includes memory elements incorporated with the C level networks. LSTMs are example of these with the memory units are embedded inside the LSTM node. Other variants of these are the Neural Turing Machine (NMT) and the Differentiable Neural Computer (DNC) from DeepMind. These systems maintain state as they compute their behavior.

3. Classification with Knowledge (CK)

This level is somewhat similar to the CM level, however rather than raw memory, the information that the C level network is able to access is a symbolic knowledge base. There are actually three kinds of symbolic integration that I have found, a transfer learning approach, a top-down approach, a bottom up approach. The first approach uses a symbolic system that acts as a regularizer. The second approach has the symbolic elements at the top of the hierarchy that are composed at the bottom by neural representations. The last approach has it reversed, where a C level network is actually attached to a symbolic knowledge base.

4. Classification with Imperfect Knowledge (CIK)

At this level, we have a system that is built on top of CK, however is able to reason with imperfect information. An example of this kind of system would be AlphaGo and Poker playing systems. AlphaGo however does not employ CK but rather CM level capability. Like AlphaGo, these kind of systems can train itself by running simulation of it against itself.

5. Collaborative Classification with Imperfect Knowledge (CCIK)

This level is very similar to the “theory of mind” where we actually have multiple agent neural networks combining to solve problems. Theses systems are designed to solve multiple objectives. We actually do se primitive versions of this in adversarial networks, that learn to perform generalization with competing discriminator and generative networks Expand that concept further into game-theoretic driven networks that are able to perform strategically and tactically solving multiple objectives and you have the making of these kind of extremely adaptive systems. We aren’t at this level yet and there’s still plenty of research to be done in the previous levels.

Different level bring about capabilities that don’t exist in the previous level. C level systems for example are only capable of predicting anti-causal relationships. CM level systems are capable of very good translation. CIK level systems are capable of strategic game play.

We can see how this classification somewhat aligns with Hinzte classification, with the exception of course of self-awareness. That’s a capability that I really have not explored and don’t intend to until the pre-requisite capabilities have been addressed. I’ve also not addressed zero-shot or one-shot learning or unsupervised learning. This is still one of the fundamental problems, as Yann Lecun has said:

If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake.

The accelerator technology in all of this however is that when the capabilities are used in a feedback loop. We actually have seen instance of this kind of ‘meta-learning’ or ‘learning to optimize’ in current research. I cover these developments in another article “Deep Learning can Now Design Itself!” The key take away with meta-methods is that our own research methods become much more powerful when we can train machines to actually discover better solutions that we otherwise could find.

This is why, despite formidable problems in Deep Learning research, we can’t really be sure how rapid progress may proceed.

Rethinking Generalization in Deep Learning

Rethinking Generalization in Deep Learning

The ICLR 2017 submission “Understanding Deep Learning required Rethinking Generalization“ is certainly going to disrupt our understanding of Deep Learning . Here is a summary of what the had discovered through experiments:

1. The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.

2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.

3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

The authors actually introduce two new definitions to express what they are observing. The talk about “explicit” and “implicit” regularization. Drop out, data augmentation, weight sharing, conventional regularization are all explicit regularization. Implicit regularization is early stopping, batch norm, and SGD. It is an extremely odd definition that we’ll discuss.

I understand regularization ( see: http://www.deeplearningpatterns.com/doku.php/regularization ) as being of two types. I use the terms “Regularization by Construction” and “Regularization by Training”. There is the Regularization by Training that is the conventional use of the term. There is also the Regularization by Construction which is a consequence of the Model choices we select as we construct the elements our network. The reason why there is a distinction, when mathematically they do appear equivalently as constraint terms, is that Regularization conventionally is not present after training, that is in the inference path. Regularization by Construction is always present, both in the training and the inference stages.

Now the paper has a distinction between explicit and implicit regularization and that is when the main intent of the method is to regularize. One does dropout to regularize, so it is explicit. One does batch normalization (BN) for normalizing the activations of the different input samples but it happens to also regularize, so it is implicit regularization. The distinction between the two is the purpose of regularization or not. The later being implicit generalization. The meaning is that the unintended consequence of the technique is regularization. So when a researcher does not think that a method would lead to regularization and to his surprise it does, then that is what they call ‘implicit’ regularization. I don’t think however Hinton expected Drop Out to lead to regularization. This is why I think the definition is extremely fuzzy, however I understand why they introduced the idea.

The goal of regularization however is to improve generalization. That is also what BN does. In fact, for inception architectures, BN is favored over drop out. Speaking about normalization, there are several kinds, Batch and Layer normalization are the two popular versions. The motivation for BN is supposed to be Domain Adaptation. Is Domain Adaptation different from Generalization? Is not just a specific kind of generalization? Are there other kinds of generalization? If so, what are they?

The authors have made the surprising discovery that methods that don’t seem to generalization, more specifically SGD, in fact does. Another ICLR 2017 paper An Empirical Analysis of Deep Network Loss Surfaces adds added confirmation to this SGD property. This paper shows empirically that the loss surfaces for different SGD methods differ from each other. This tells you that what is happening is very different from traditional optimization.

It reminds one of quantum mechanics, where probes affect observation. Here learning method affects what is learned. In this new perspective of neural networks, that of brute force memorization or alternatively holographic machines, then perhaps ideas of quantum mechanics may need to come in play. Quantum mechanics emerges because of the non-commutability of poisson brackets in classical dynamics. We have two variables, position and momentum, that are inextricably tied together. In Deep Learning I have a hunch that there are more than two variables that are tied together that leads to regularization. We at least have 3 variables: learning method, network model and generative model that all seem to have an effect on generalization. The troubling discover however is how ineffective conventional regularization appears to be. “Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error”

I think right now we have a very blunt instrument when it comes to our definition of Generalization. I wrote here that there are at least 5 different notions of generalization ( http://www.deeplearningpatterns.com/doku.php/risk_minimization ).

Definition 1: Error Response to Validation and Real Data

We can define it as the behavior of our system in response to validation data. That is against data that we have not included as part of the training set. We be a bit more ambitious and define it as behavior when the system is deployed to analyze real world data. We essentially would like to see our trained system perform accurately in the context of data it has never seen.

Definition 2: Sparsity of Model

A second definition is based on the idea of Occam’s Razor. That is, the most simplest of explanations is like the best explanation. Here we make certain assumptions of the form of the data and we drive our regularization to constrain the solution toward our assumptions. So for example in the field of compressive sensing, we assume that a sparse basis exists. From there we can drive an optimization problem that searches solutions that have a sparse basis.

Definition 3: Fidelity in Generating Models

A third definition is based on the systems ability to recreate or reconstruct the features. This is the approach taken by generative models. If a neural network is able to accurately generate realistic images, then it able to capture the concept of images in its entirety. We see this approach taken by researchers working on generative methods.

Definition 4: Effectiveness in Ignoring Nuisance Features

A fourth definition involves the notion of ignoring invariant features or nuisance variables. That is, a system is able to generalize well if it able to ignore invariant features for its tasks. Remove away as much features as possible until you can’t remove anymore. This is somewhat similar to the third definition however it tackles the problem from another perspective.

Definition 5: Risk Minimization

A fifth generalization definition revolves around the idea of minimizing risk. When we train our system, there is an uncertainty in the context in which it will be deployed. So we train our models with mechanisms to anticipate unpredictable situations. The hope is that the system is robust to contexts that have not been previously predicted. This is kind of a game theoretic definition. We can envision an environment where information will always remain imperfect and generalization effectively means executing a particular strategy within the environment. This may be the most abstract definition of generalization that we have.

I’m sure there are many more as we move to a more game theoretic framework of learning. In fact, one effective strategy for learning is driven by the notion of “Curiosity”.

11 Biases Why Experts will Miss the Deep Learning Revolution

11 Biases Why Experts will Miss the Deep Learning Revolution

I spend most of my waking time ( and likely my subconscious works overtime while I sleep ) studying Deep Learning.  Peter Thiel  has a phrase, “The Last Company Advantage”.  Basically you don’t necessarily need to have the “First Mover Advantage” however you absolutely want to be the last company standing in your kind of business.  So Google may be the last Search company, Amazon may be the last E-Commerce company and Facebook hopefully will not be the last Social Networking company.   What keeps me awake at night though is that Deep Learning could in fact be the “Last Invention of Man”!

However, let’s ratchet it  down a little bit here.  After all, Kurzweil’s Singularity (estimate is 2045) is still 3 decades away.    That’s still plenty of time for us humans to scheme on our little monopolies.  Your objective in the next 30 years of humankind is to figure out if you are going to be living Elysium or in some unnamed decaying backwater:

elysiumvsearth

Credit: Elysium the movie,  not the life-extension supplement.

Or said in differently, whether you take the blue pill or the red pill (or maybe I meant state).

To aid you in your decision making,  here are 10 reasons why your “experts” are saying that will lead you to an unfortunate reality that you will miss the train:

cognitive_bias_codex_-_180_biases_designed_by_john_manoogian_iii_jm3-1

 

Figure  Illustration by John Manoogian III. Cognitive biases can be organized into four categories: biases that arise from too much information, not enough meaning, the need to act quickly, and the limits of memory.  From https://en.wikipedia.org/wiki/List_of_cognitive_biases

It’s just Machine Learning (It is a generalization of what I used to do. Well Traveled Road Effect Bias)

Practitioners introduction to neural networks are almost always via the introduction of linear regression and then to logistic regression.  That’s because the mathematical equations for an artificial neural network (ANN) are identical.  So there immediately is a bias here that the characteristics of these classical ML methods would also convey into the world of DL.  Afterall, DL in its most naive explanation is nothing more than multiple layers of ANN.

There are also other kinds of ML methods that have equations that are different from DL.  The basic objective however for all ML methods is a general notion of curve fitting.  That is if you can have a good fit of a model with the data then that perhaps is a good solution.  Unfortunately with DL systems, due to the fact that the number of parameters in the model are so large, these systems by default will over-fit any data.  This is enough of a tell that a DL is an entirely different kind of animal from an ML system.

It’s just Optimization (It is a simple case of what I usually do. Illusion of Control Bias )

DL systems have a loss function that is a measure of how well its predictions match its input data.  Classic optimization problems also have loss functions (also known as objective functions).  In both systems different kinds of heuristics are used to discover an optimal point in a large configuration space.  It was once thought that the solution surface of a DL system was sufficiently complex enough that it would be impossible to arrive at a solution.  However, curiously enough, one of the most simple methods of optimization, the Stochastic Gradient Descent algorithm,  is all that is need to arrive at surprising results.  

What this tells you is that is something else going on here that is actually very different from what optimization folks are used to.

It’s a black box ( I can’t trust the unknown. Ambiguity Effect )

A lot of Data Scientists have an aversion for DL because of the lack of interpretability of its predictions.   This is a characteristic of not only DL methods but classical ML methods as well.  Data Scientists would rather use Probabilistic methods where they can have better control of the models or priors.  As a result have systems that are able to make predictions with the least number of parameters.  All driven by the belief that parsimony or Occam’s razor is the optimal explanation for everything.  

Unfortunately probabilistic methods are not competitive in classifying images, speech or even text.    That’s because DL methods are superior in discovering models than human beings.  Brute force just happens to  trump wetware.   There’s no experimental evidence in the DL space that parsimonious models work any better than entangled models.  For those cases where it is an absolute requirement to have some kind of explanation,   there are now newer methods in DL that provide aid to interpretability as well as uncertainty.   If a DL system can generate the captions in an image, then there is a good chance that it can be trained to generate an explanation of a prediction.

It’s too early and too soon ( I don’t trust anything that’s new. Illusion of Validity Bias )

This is a natural bias that something that is around 5 years old and rapidly evolving is too new and volatile a technology to trust.  I think we all said the same thing when the microprocessor, internet, web, mobile technologies came along.  Wait and see was the safe approach for most everyone.  This is certainly a reasonable approach for anyone who has not really spent the time investigating the details.  However, it is a very risky strategy, ignorance may be bliss but another company eating your lunch can mean extinction.

There is too much hype. (Conservatism Bias)

There are a lot things that DL can do that were deems inconceivable just a couple years ago.  Nobody expected a computer to beat the best human player in Go.  Nobody expected self-driving cars to exist today.  Nobody expected to see Star Trek universal translator like capabilities.   It is so unbelievable that it must likely be an exaggeration than something that may be real.  I hate however to burst your bubble of ignorance,  DL is in fact very real and you experience it yourself with every smart phone.

AI winter will likely come again. (Frequency Illusion Bias)

We’ve had so many times where the promise of AI had lead to disappointing results.  The argument goes further that because it has happened so often before, that it is also bound to happen again.   The problem with this argument is that despite disappointment, AI research has led to many software capabilities that we do take for granted today and thus never notice its existence.  Good old fashioned AI (GOFAI) are embedded in many systems today.  

The current pace of DL development is accelerating and there are certainly certain big problems that need to be solved.  The need for a lot of training data and the lack of unsupervised training are two problems.  This however doesn’t mean that what we have today has no value.  DL can already drive cars, that in itself tells you that even if another AI winter arrives, we would have achieved a state of development that is still quite useful.

There’s not enough theory of how it works. (System Justification Bias)

The research community does not have a solid theoretical understanding as to why DL works so effectively.   We have some idea as to why a multi-layer neural network is more efficient in fitting functions than one with less layers.  We however don’t have an understanding as to why convergence even occurs or why good generalization happens.  DL at this time is very experimental and we are just learning to characterize these kinds of systems.   Meanwhile, despite not having a good theoretical understanding,  the engineering barrels forward.   Researchers,  using their intuition and educated guesses are able to build exceedingly better models.   In other words,  nobody is stopping their work to wait for a better theory.   It is almost analogous with what happens in biotechnology research.   People are experimenting with many different combinations and arriving at new discoveries that they have yet to explain.    Scientific and technological progress is very messy and one shouldn’t shy away from the benefits because of the chaos.

It is not biologically inspired. (Anthropomorphism Bias)

DL system are very unlike the neurons in our brain.  The mechanism of  how DL learns (i.e. SGD) is not something we can explain happening in our brain.  The argument here though is that if it doesn’t resemble the brain then it is unlike to be able to perform the kind of inference and learning of a brain.  This of course is an extremely weak argument.  Afterall, planes don’t look like birds,  but they certainly can fly.  

I’m not an expert in it. (Not Invented Here Bias)

Not having expertise in-house shouldn’t be an excuse of avoiding finding expertise outside.  Furthermore, should prevent you from having your experts learn this new technology.   However, if these experts are of the dogmatic persuasion, then that should be a tell for you to get a second and unbiased opinion.

It does not apply to my problems (Ostrich Effect)

Businesses are composed of many business processes.  Unless you have not gone through the exercise of examining which processes can be automated with current DL technologies, then you are not in a position to make the statement that DL does not apply to you.   Furthermore, you may discover new processes and business opportunities may not exist today but are possible with the exploitation of DL technology.  You cannot really answer this question until you invested in some due diligence work.

I don’t have the resources (Status Quo Bias)

The large internet companies like Google and Facebook have gobbled up a lot of the Deep Learning talent out there.   These companies have very little interest in working with a small business to identify their specific needs and opportunities.  However, fortunately these big companies have been gracious enough to allow their researchers to publish their work.  We therefore do have a view into their latest developments and thus are able to take what they’ve learned and apply it to your context.  There are companies like Intuition Machine that do have an onboarding process  for you to get a competitive head start in DL technologies.

Please reach out Intuition Machine use to learn how to use Deep Learning in your business.

Meta-Learning in Deep Learning is now Reality

Meta-Learning in Deep Learning is now Reality

Note:  This is a short version of “Deep Learning – The Unreasonable Effectiveness of Randomness”.

main-qimg-71fa125b20164ebf3bad46504e2eb55e

The paper submissions for ICLR 2017 in Toulon France deadline has arrived and instead of a trickle of new knowledge about Deep Learning we get a massive deluge. This is a gold mine of research that’s hot off the presses. Many papers are incremental improvements of algorithms of the state of the art. I had hoped to find more fundamental theoretical and experimental results of the nature of Deep Learning, unfortunately there were just a few. There was however 2 developments that were mind boggling.

The mind boggling discovery that you can train a neural network to learn to learn (i.e. meta-learning). More specifically, several research groups have trained neural networks to perform stochastic gradient descent (SGD). Not only have they been able to demonstrate neural networks that have learned SGD, the networks have performed better than any hand tuned human method! The two papers that were submitted were”Deep Reinforcement Learning for Accelerating the Convergence Rate” and “Optimization as a Model for Few-Shot Learning” . Unfortunately though, these two groups have been previously scooped by Deep Mind, who showed that you could do this in this paper “Learning to Learn by gradient descent by gradient descent“. The two latter papers trained an LSTM, while the first one trained via RL. I had thought that it would take a bit longer to implement meta-learning, but it has arrived much sooner than I had expected!

Not to be out-done, two other research groups created machines that design new Deep Learning networks and do it in such a way as to improve over the state-of-the-art! These are machines that learn how to design neural networks. The two papers that were submitted are “Designing Neural Network Architectures using Reinforcement Learning” and “Neural Architecture Search with Reinforcement Learning”. The former paper describes the use of Reinforcment Q-Learning to discover CNN architectures (You can find some of their generated CNNs in Caffe here: https://bowenbaker.github.io/metaqnn/ ). The latter paper is however truly astounding (you can’t do this without Google’s compute resources). Not only did the researchers show the generation of state-of-the-art CNN networks, the machine actually learned a few more variants of the LSTM node! Here are the LSTM nodes the machine created (left and bottom):

So not only are researcher who hand optimize gradient descent solutions out of business, so are folks who make a living designing neural architectures! This is actually just the beginning of Deep Learning systems just bootstrapping themselves. So I must now share Schmidhuber’s cartoon that aptly describes what is happening:

This is absolutely shocking and there’s really no end in sight as to how quickly Deep Learning algorithms are going to improve. This meta capability allows you to apply it on itself, recursively creating better and better systems.

Deep Learning – The Unreasonable Effectiveness of Randomness

Deep Learning – The Unreasonable Effectiveness of Randomness

The paper submissions for ICLR 2017 in Toulon France deadline has arrived and instead of a trickle of new knowledge about Deep Learning we get a massive deluge. This is a gold mine of research that’s hot off the presses. Many papers are incremental improvements of algorithms of the state of the art. I had hoped to find more fundamental theoretical and experimental results of the nature of Deep Learning, unfortunately there were just a few. There was however 2 developments that were mind boggling and one paper that is something I’ve been suspecting for a while now and has finally been confirm to shocking results. It really is a good news, bad new story.

First let’s talk about the good news. The first is the mind boggling discovery that you can train a neural network to learn to learn (i.e. meta-learning). More specifically, several research groups have trained neural networks to perform stochastic gradient descent (SGD). Not only have they been able to demonstrate neural networks that have learned SGD, the networks have performed better than any hand tuned human method! The two papers that were submitted were”Deep Reinforcement Learning for Accelerating the Convergence Rate” and “Optimization as a Model for Few-Shot Learning” . Unfortunately though, these two groups have been previously scooped by Deep Mind, who showed that you could do this in this paper “Learning to Learn by gradient descent by gradient descent“. The two latter papers trained an LSTM, while the first one trained via RL. I had thought that it would take a bit longer to implement meta-learning, but it has arrived much sooner than I had expected!

Not to be out-done, two other groups created machines that could design new Deep Learning networks and do it in such a way as to improve on the state-of-the-art! This is learning to design neural networks. The two papers that were submitted are “Designing Neural Network Architectures using Reinforcement Learning” and “Neural Architecture Search with Reinforcement Learning”. The former paper describes the use of Reinforcment Q-Learning to discover CNN architectures. You can find some of their generated CNNs in Caffe here: https://bowenbaker.github.io/metaqnn/ . The latter paper is truly astounding (you can’t do this without Google’s compute resources). Not only did they show state-of-the-art CNN networks, the machine actually learned a few more variants of the LSTM node! Here are the LSTM nodes the machine created (left and bottom):

So not only are researcher who hand optimize gradient descent solutions out of business, so are folks who make a living designing neural architectures! This is actually just the beginning of Deep Learning systems just bootstrapping themselves. So I must now share Schmidhuber’s cartoon that aptly describes what is happening:

This is absolutely shocking and there’s really no end in sight as to how quickly Deep Learning algorithms are going to improve. This meta capability allows you to apply it on itself, recursively creating better and better systems.

Permit me now to deal you the bad news. Here is the paper that is the bearer of that news: “Understanding Deep Learning required Rethinking Generalization“. I’ve thought about Generalization a lot, and I’ve posted out some queries in Quora about Generalization and also about Randomness in the hope that someone could give some good insight. Unfortunately, nobody had enough of an answer or understood the significance of the question until the folks who wrote the above paper performed some interesting experiments. Here is a snippet of what they had found:

1. The effective capacity of neural networks is large enough for a brute-force memorization of the entire data set.

2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.

3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

The shocking truth revealed. Deep Learning networks are just massive associative memory stores! Those high-dimensional manifolds that are supposedly created, they appear to be figments of our imagination! That’s why adversarial features work, there aren’t any continuously navigable manifolds! It is kind of like when we discovered the nucleus and the electron, there’s mostly empty space between them. Atoms aren’t hard spheres!

John Hopcroft wrote a paper early this year examining the duality of Neural Networks and Associative Memory. Here’s a figure from his paper:

The “Rethinking Generalization” paper goes even further by examining our tried and true tool for achieving Generalization (i.e. Regularization) and finds that:

Explicit regularization may improve generalization performance, but is neither necessary nor by itself sufficient for controlling generalization error.

In other words, all our regularization tools may be less effective than what we believe! Furthermore, even more shocking, the unreasonable effectiveness of SGD turns out to be:

Appealing to linear models, we analyze how SGD acts as an implicit regularizer.

just a different kind of regularization that just happens to work!

In fact, a paper submitted for ICLR2017 by another group titled “An Empirical Analysis of Deep Network Loss Surfaces” confirms that the local minima of these networks are different:

Our experiments show that different optimization methods find different minima, even when switching from one method to another very late during training. In addition, we found that the minima found by different optimization methods have different shapes, but that these minima are similar in terms of the most important measure, generalization accuracy.

Which tells you that your choice of learning algorithm “rigs” how it arrives at a solution. Randomness is ubiquitous and it does not matter how you regularize your network or what the SGD variant that you employ, the network just seems to evolve (if you set the right random conditions) towards convergence! What are the properties of SGD that leads to machines that can learn? Are the properties tied to differentiation or is it something more general? If we can teach a network to perform SGD, can we teach it to perform this unknown generalized learning method?

The effectiveness of this randomness was in fact demonstrated earlier this year in a paper: “A Powerful Generative Model Using Random Weights for the Deep Image Representation” also co-authored by John Hopcroft that showed that you could generate realistic imagery using randomly initialized networks without any training! How could this be possible?

Therefore to understand Deep Learning, we must embrace randomness. Randomness arises from maximum entropy, which interestingly enough is not without its own structure! The strangeness here is that Randomness is ubiquitous in the universe. The arrow of time is reflected by the direction towards greater entropy. How then is it that this property is also the basis of learning machines?

Please see Design Patterns for Deep Learning: Canonical Patterns for additional insight on this intriguing subject.

Why Deep Learning is Radically Different from Machine Learning

Why Deep Learning is Radically Different from Machine Learning

There is a lot of confusion these days about Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL).   There certainly is a massive uptick of articles about AI being a competitive game changer and that enterprises should begin to seriously explore the opportunities.   The distinction between AI, ML and DL are very clear to practitioners in these fields.  AI is the all encompassing umbrella that covers everything from Good Old Fashion AI (GOFAI) all the way to connectionist architectures like Deep Learning.  ML is a sub-field of AI that covers anything that has to do with the study of learning algorithms by training with data.  There are a whole swaths (not swatches) of techniques that have been developed over the years like Linear Regression, K-means, Decision Trees, Random Forest, PCA, SVM and finally Artificial Neural Networks (ANN).  Artificial Neural Networks is where the field of Deep Learning had its genesis from.

Some ML practitioners who have had previous exposure to Neural Networks (ANN), after all it was invented in the early 60’s,  would have the first impression that Deep Learning is nothing more than ANN with multiple layers.   Furthermore, the success of DL is more due to the availability of more data and the availability of more powerful computational engines like Graphic Processing Units (GPU).  This of course is true, the emergence of DL is essentially due to these two advances, however the conclusion that DL is just a better algorithm than SVM or Decision Trees is akin to focusing only on the trees and not seeing the forest.

To coin Andreesen who said “Software is eating the world”, “Deep Learning is eating ML”.  Two publications by practitioners of different machine learning fields have summarized it best as to why DL is taking over the world.   Chris Manning an expert in NLP writes about the “Deep Learning Tsunami“:

Deep Learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences. However, some pundits are predicting that the final damage will be even worse.

Nicholas Paragios  writes about the “Computer Vision Research: the Deep Depression“:

It might be simply because deep learning on highly complex, hugely determined in terms of degrees of freedom graphs once endowed with massive amount of annotated data and unthinkable – until very recently – computing power can solve all computer vision problems. If this is the case, well it is simply a matter of time that industry (which seems to be already the case) takes over, research in computer vision becomes a marginal academic objective and the field follows the path of computer graphics (in terms of activity and volume of academic research).

These two publications highlight how the field of Deep Learning is fundamentally changing current practices.

The current DL hype tends to be that we have these commoditized machinery, that given enough data and enough training time, is able to learn on its own.   This of course either an exaggeration of what the state-of-the-art is capable of or an over simplification of the actual practice of DL.  DL has over the past few years given rise to a massive collection of ideas and techniques that were previously either unknown or known to be untenable.   At first this collection of concepts, seems to be fragmented and disparate.  However over time patterns and methodologies begin to emerge and we are frantically attempting to cover this space in “Design Patterns of Deep Learning“.

Periodic Table

 

Deep Learning today goes beyond just multi-level perceptrons but instead is a collection of techniques and methods that are used to building composable differentiable architectures.   These are extremely capable machine learning systems that we are only right now seeing just the tip of the iceberg.  The key take away from this is that, Deep Learning may look like alchemy today, but we eventually will learn to practice it like chemistry.  That is, we would have a more solid foundation so as to be able to build our learning machines with greater predictability of its capabilities.

 

The Unreasonable Simplicity of Universal Machines

The Unreasonable Simplicity of Universal Machines

Rule 110 cellular automata, or more specifically the one dimensional cellular automata (you can explore those here http://atlas.wolfram.com/01/01/ ) that has the following rule:

current pattern 111 110 101 100 011 010 001 000
new state for center cell 0 1 1 0 1 1 1 0

is all the complexity that one needs to create a machine that has all the computational capability of a Turing Machine, hence any computer system.

NAND gates (or alternatively NOR gates):

INPUT OUTPUT
A B A NAND B
0 0 1
0 1 1
1 0 1
1 1 0
is all the logic one needs to compose any boolean equation.
How does a Rule 110 automata differ from a NAND gate?   The NAND gate has 4 rules, the automata has however 8 rules.   If we look closely, we see that the Rule 110 automata contains all the rules of the NAND gate.  Specifically,  010 -> 1,  011 0 ->, 110 -> 1 and 111 -> 1.  In other words, if the center cell is set to 1, then Rule 110 acts just like a NAND gate.   However, there are 14 other cellular automata that have the capture the NAND logic but are not universal.
The cellular automata state of 0 for Rule 110 automata apparently has some additional capability that leads to universal behavior.  Let’s examine these, for when the center cell is 0, the behavior becomes:
1 0 1 -> 1
1 0 0 -> 0
0 0 1 -> 1
0 0 0 -> 0
or if we ignore the center cell:
1  1  -> 1
1 0  -> 0
0 1  -> 1
0 0  -> 0
The middle two rules appear to break symmetry in that there’s a clear distinction as to which neighbor cell is on.
Let’s examine another automata, Rule 30 that is known to be chaotic:
current pattern 111 110 101 100 011 010 001 000
new state for center cell 0 0 0 1 1 1 1 0
 For when the center cell is 0:
101 -> 0
100 -> 1
001 -> 1
000 -> 0
which is a XOR
and when the center cell is 1:
111 -> 0
110 -> 0
011 -> 1
010 -> 1
with that symmetry breaking that we see in Rule 110.
The complement of Rule 110 is Rule 137:
current pattern 111 110 101 100 011 010 001 000
new state for center cell 1 0 0 0 1 0 0 1

 

Which is the same as 110, but instead with a universal NOR gate.

111 -> 1

110 -> 0

011 -> 1

010 -> 0

Which is the same behavior as Rule 110 but with the center state now 1 instead of 0.

If we replace the rule for 111 and 010 to  111 -> 0 and 010 -> 0 we have

current pattern 111 110 101 100 011 010 001 000
new state for center cell 0 0 0 0 1 1 0 1

which is Rule 13 and not universal.

So its not just the symmetry breaking that’s important, but the fact that 11->1 and 00->0 are important.  Note: Flipping the rule for 10 and 01 are also universal.

So, what perhaps is the significance of this circuitry…

1  0 1  -> 1
1 0 0  -> 0
0 0 1  -> 1
0 0 0  -> 0
that leads to universality?
What we see with these rules is that the value on the right neighbor cell becomes the center cell.  The mirror rule 124 shifts from the left  and the complement rule shifts also from the right.  So to achieve a universal machine one just needs two rules.  A NAND or NOR operator and a shift operator.  The center cell determines which operator is active at the time.
Now that we have found the simplest machine possible, can we now attempt to identify the simplest machine that can learn? If we are able to do this, we can then show that a majority of systems in nature are in fact learning machines!
10 Lessons Learned from Building Deep Learning Systems

10 Lessons Learned from Building Deep Learning Systems

Deep Learning is a sub-field of Machine Learning that has its own peculiar ways of doing things.  Here are 10 lessons that we’ve uncovered while building Deep Learning systems.  These lessons are a bit general, although they do focus on applying Deep Learning in a area that involves structured and unstructured data.

The More Experts the Better

The one tried and true way to improve accuracy is to have more networks perform the inferencing and combining the results.  In fact, techniques like DropOut is a means of creating “Implicit Ensembles” were multiple subsets of superimposed networks cooperate using shared weights.

Seek Problems where Labeled Data is Abundant

The current state of Deep Learning is that it works well only in a supervised context.  The rule of thumb is around 1,000 samples per rule.  So if you are given a problem where you don’t have enough data to train with, try considering an intermediate problem that does have more data and then run a simpler algorithm with the results from the intermediate problem.

Search for ways to Synthesize Data

Not all data is nicely curated and labeled for machine learning.  Many times you have data that are weakly tagged.  If you can join data from disparate sources to achieve a weakly labeled set, then this approach works surprisingly well.  The most well known example is Word2Vec where you train for word understanding based on the words that happen to be in proximity with other words.

Leverage Pre-trained Networks

One of the spectacular capabilities of Deep Learning networks is that bootstrapping from an existing pre-trained network and using it to train into a new domain works surprisingly well.

Don’t forget to Augment Data

Data usually have meaning that a human may be aware of that a machine can likely never discover.  One simple example is a time feature.  From the perspective of a human the day of the week or the time of the day may be important attributes, however a Deep Learning system may never be able to surface that if all its given are seconds since Unix epoch.

Explore Different Regularizations

L1 and L2 regularizations are not the only regularizations that are out there.  Explore the different kinds and perhaps look at different regularizations per layer.

Randomness Initialization works Surprisingly Well

There are multiple techniques to initialize your network prior to training.  In fact, you can get very far just training the last layer of a network with the previous layers being mostly random.  Consider using this technique to speed up you Hyper-tuning explorations.

End-to-End Deep Learning is a Hail Mary Play

A lot of researchers love to explore end-to-end deep learning research.  Unfortunately, the most effective use of Deep Learning has been to couple it with out techniques.  AlphaGo would not have been successful if Monte Carlo Tree Search was not employed.

Resist the Urge to Distribute

If you can, try to avoid using multiple machines (with the exception of hyper-parameter tuning).  Training on a single machine is the most cost effective way to proceed.

Convolution Networks work pretty well even beyond Images

Convolution Networks are clearly the most successful kind of network in the Deep Learning space.  However, ConvNets are not only for Images,  you can use them for other kinds of features (i.e. Voice, time series, text).

That’s all I have for now.  There certainly a lot more other lessons.  Let me know if you stumble on others.

You can find more details of these individual lessons at http://www.deeplearningpatterns.com

A Development Methodology for Deep Learning

A Development Methodology for Deep Learning

The practice of software development has created development methodologies such agile development and lean methodology to tackle the complexity of development with the objective of improving the quality and efficiency of software creation. Although Deep Learning is built from software it is a different kind of software and therefore a different kind of methodology is needed. Deep Learning differs most from traditional software development in that a substantial portion of the process involves the machine learning how to achieve objectives. The developer is not completely out of the equation, but is working in concert to tweak the Deep Learning algorithm.

Deep Learning is sufficiently rich and complex a subject that a process model or methodology is required to guide a developer. The methodology addresses the necessary interplay of the need for more training data and the exploration of alternative Deep Learning patterns that drive the discovery of an effective architecture. The methodology depicted as follows:

We begin first we some initial definition of the kind of architecture we wish to train. This will of course be driven by the nature of the data that we a training from and the kind of prediction we seek. The latter is guided by Explanatory Patterns and the former by Feature Patterns. There are a variety ways to optimize our training process, this is guided by the Learning Patterns.

After the selection of our network model and the data we plan on training on, the developer is then tasked with answering the question as to whether adequate labeled training set is available. This process goes beyond conventional machine learning process that divides the dataset into three sets. The machine learning convention has been to create a training set, a validation set and a test set. In the first step of the process, if the training remains high there are several options that can be pursued. The first is to try to increase the size of the model, a second option is perhaps train a bit long (alternatively perform hyper-parameter tuning) and if all fails then the developer tweaks the architecture or attempts a new architecture. In the second step of the process, a develop validates the training against a validation set, if the error rate is high indicating overfitting then the options are to find more data, apply different regularizations and if all fails attempt another architecture. The observations here that differs from conventional machine learning is that Deep Learning has more flexibility in that a developer has the additional options of employing either a bigger model or using more data. One of the hallmarks of deep learning is its scalability in performing well when trained with large data sets.

Trying a larger model is something that a developer has control over, unfortunately finding more data poses a more difficult problem. To satisfy this need for more data one can leverage data from different contexts. In addition one can employing data synthesis and data augmentation to increase the size of training data. These approaches however lead to domain adaptation issues, so a slight change in the traditional machine learning development model is called for. In this extended approach, the validation and training sets are required to belong to the same context. Furthermore, to validate training from this heterogeneous set, another set called the training-validation set is set aside to act as additional validation. This basic process model, inspired by a talk by Andrew Ng, serves as a good scaffolding to hang off the many different patterns that we find in Deep Learning.

As you can see, there are many paths of exploration and many alternatives models that may be explored to reach to a solution. Furthermore, their is sufficient Modularity in Deep Learning that we may compose solutions from other previously developed solutions. Autoencoders, Neural Embedding, Transfer Learning and bootstrapping with pre-trained networks are some of the tools that do provide potential shortcuts that reduce the need to train from scratch.

This is no means complete, but it is definitely a good starting point to guide the development of this entirely new kind of architecture.