Browsed by
Month: October 2016

The Unreasonable Simplicity of Universal Machines

The Unreasonable Simplicity of Universal Machines

Rule 110 cellular automata, or more specifically the one dimensional cellular automata (you can explore those here http://atlas.wolfram.com/01/01/ ) that has the following rule:

current pattern 111 110 101 100 011 010 001 000
new state for center cell 0 1 1 0 1 1 1 0

is all the complexity that one needs to create a machine that has all the computational capability of a Turing Machine, hence any computer system.

NAND gates (or alternatively NOR gates):

INPUT OUTPUT
A B A NAND B
0 0 1
0 1 1
1 0 1
1 1 0
is all the logic one needs to compose any boolean equation.
How does a Rule 110 automata differ from a NAND gate?   The NAND gate has 4 rules, the automata has however 8 rules.   If we look closely, we see that the Rule 110 automata contains all the rules of the NAND gate.  Specifically,  010 -> 1,  011 0 ->, 110 -> 1 and 111 -> 1.  In other words, if the center cell is set to 1, then Rule 110 acts just like a NAND gate.   However, there are 14 other cellular automata that have the capture the NAND logic but are not universal.
The cellular automata state of 0 for Rule 110 automata apparently has some additional capability that leads to universal behavior.  Let’s examine these, for when the center cell is 0, the behavior becomes:
1 0 1 -> 1
1 0 0 -> 0
0 0 1 -> 1
0 0 0 -> 0
or if we ignore the center cell:
1  1  -> 1
1 0  -> 0
0 1  -> 1
0 0  -> 0
The middle two rules appear to break symmetry in that there’s a clear distinction as to which neighbor cell is on.
Let’s examine another automata, Rule 30 that is known to be chaotic:
current pattern 111 110 101 100 011 010 001 000
new state for center cell 0 0 0 1 1 1 1 0
 For when the center cell is 0:
101 -> 0
100 -> 1
001 -> 1
000 -> 0
which is a XOR
and when the center cell is 1:
111 -> 0
110 -> 0
011 -> 1
010 -> 1
with that symmetry breaking that we see in Rule 110.
The complement of Rule 110 is Rule 137:
current pattern 111 110 101 100 011 010 001 000
new state for center cell 1 0 0 0 1 0 0 1

 

Which is the same as 110, but instead with a universal NOR gate.

111 -> 1

110 -> 0

011 -> 1

010 -> 0

Which is the same behavior as Rule 110 but with the center state now 1 instead of 0.

If we replace the rule for 111 and 010 to  111 -> 0 and 010 -> 0 we have

current pattern 111 110 101 100 011 010 001 000
new state for center cell 0 0 0 0 1 1 0 1

which is Rule 13 and not universal.

So its not just the symmetry breaking that’s important, but the fact that 11->1 and 00->0 are important.  Note: Flipping the rule for 10 and 01 are also universal.

So, what perhaps is the significance of this circuitry…

1  0 1  -> 1
1 0 0  -> 0
0 0 1  -> 1
0 0 0  -> 0
that leads to universality?
What we see with these rules is that the value on the right neighbor cell becomes the center cell.  The mirror rule 124 shifts from the left  and the complement rule shifts also from the right.  So to achieve a universal machine one just needs two rules.  A NAND or NOR operator and a shift operator.  The center cell determines which operator is active at the time.
Now that we have found the simplest machine possible, can we now attempt to identify the simplest machine that can learn? If we are able to do this, we can then show that a majority of systems in nature are in fact learning machines!
10 Lessons Learned from Building Deep Learning Systems

10 Lessons Learned from Building Deep Learning Systems

Deep Learning is a sub-field of Machine Learning that has its own peculiar ways of doing things.  Here are 10 lessons that we’ve uncovered while building Deep Learning systems.  These lessons are a bit general, although they do focus on applying Deep Learning in a area that involves structured and unstructured data.

The More Experts the Better

The one tried and true way to improve accuracy is to have more networks perform the inferencing and combining the results.  In fact, techniques like DropOut is a means of creating “Implicit Ensembles” were multiple subsets of superimposed networks cooperate using shared weights.

Seek Problems where Labeled Data is Abundant

The current state of Deep Learning is that it works well only in a supervised context.  The rule of thumb is around 1,000 samples per rule.  So if you are given a problem where you don’t have enough data to train with, try considering an intermediate problem that does have more data and then run a simpler algorithm with the results from the intermediate problem.

Search for ways to Synthesize Data

Not all data is nicely curated and labeled for machine learning.  Many times you have data that are weakly tagged.  If you can join data from disparate sources to achieve a weakly labeled set, then this approach works surprisingly well.  The most well known example is Word2Vec where you train for word understanding based on the words that happen to be in proximity with other words.

Leverage Pre-trained Networks

One of the spectacular capabilities of Deep Learning networks is that bootstrapping from an existing pre-trained network and using it to train into a new domain works surprisingly well.

Don’t forget to Augment Data

Data usually have meaning that a human may be aware of that a machine can likely never discover.  One simple example is a time feature.  From the perspective of a human the day of the week or the time of the day may be important attributes, however a Deep Learning system may never be able to surface that if all its given are seconds since Unix epoch.

Explore Different Regularizations

L1 and L2 regularizations are not the only regularizations that are out there.  Explore the different kinds and perhaps look at different regularizations per layer.

Randomness Initialization works Surprisingly Well

There are multiple techniques to initialize your network prior to training.  In fact, you can get very far just training the last layer of a network with the previous layers being mostly random.  Consider using this technique to speed up you Hyper-tuning explorations.

End-to-End Deep Learning is a Hail Mary Play

A lot of researchers love to explore end-to-end deep learning research.  Unfortunately, the most effective use of Deep Learning has been to couple it with out techniques.  AlphaGo would not have been successful if Monte Carlo Tree Search was not employed.

Resist the Urge to Distribute

If you can, try to avoid using multiple machines (with the exception of hyper-parameter tuning).  Training on a single machine is the most cost effective way to proceed.

Convolution Networks work pretty well even beyond Images

Convolution Networks are clearly the most successful kind of network in the Deep Learning space.  However, ConvNets are not only for Images,  you can use them for other kinds of features (i.e. Voice, time series, text).

That’s all I have for now.  There certainly a lot more other lessons.  Let me know if you stumble on others.

You can find more details of these individual lessons at http://www.deeplearningpatterns.com

A Development Methodology for Deep Learning

A Development Methodology for Deep Learning

The practice of software development has created development methodologies such agile development and lean methodology to tackle the complexity of development with the objective of improving the quality and efficiency of software creation. Although Deep Learning is built from software it is a different kind of software and therefore a different kind of methodology is needed. Deep Learning differs most from traditional software development in that a substantial portion of the process involves the machine learning how to achieve objectives. The developer is not completely out of the equation, but is working in concert to tweak the Deep Learning algorithm.

Deep Learning is sufficiently rich and complex a subject that a process model or methodology is required to guide a developer. The methodology addresses the necessary interplay of the need for more training data and the exploration of alternative Deep Learning patterns that drive the discovery of an effective architecture. The methodology depicted as follows:

We begin first we some initial definition of the kind of architecture we wish to train. This will of course be driven by the nature of the data that we a training from and the kind of prediction we seek. The latter is guided by Explanatory Patterns and the former by Feature Patterns. There are a variety ways to optimize our training process, this is guided by the Learning Patterns.

After the selection of our network model and the data we plan on training on, the developer is then tasked with answering the question as to whether adequate labeled training set is available. This process goes beyond conventional machine learning process that divides the dataset into three sets. The machine learning convention has been to create a training set, a validation set and a test set. In the first step of the process, if the training remains high there are several options that can be pursued. The first is to try to increase the size of the model, a second option is perhaps train a bit long (alternatively perform hyper-parameter tuning) and if all fails then the developer tweaks the architecture or attempts a new architecture. In the second step of the process, a develop validates the training against a validation set, if the error rate is high indicating overfitting then the options are to find more data, apply different regularizations and if all fails attempt another architecture. The observations here that differs from conventional machine learning is that Deep Learning has more flexibility in that a developer has the additional options of employing either a bigger model or using more data. One of the hallmarks of deep learning is its scalability in performing well when trained with large data sets.

Trying a larger model is something that a developer has control over, unfortunately finding more data poses a more difficult problem. To satisfy this need for more data one can leverage data from different contexts. In addition one can employing data synthesis and data augmentation to increase the size of training data. These approaches however lead to domain adaptation issues, so a slight change in the traditional machine learning development model is called for. In this extended approach, the validation and training sets are required to belong to the same context. Furthermore, to validate training from this heterogeneous set, another set called the training-validation set is set aside to act as additional validation. This basic process model, inspired by a talk by Andrew Ng, serves as a good scaffolding to hang off the many different patterns that we find in Deep Learning.

As you can see, there are many paths of exploration and many alternatives models that may be explored to reach to a solution. Furthermore, their is sufficient Modularity in Deep Learning that we may compose solutions from other previously developed solutions. Autoencoders, Neural Embedding, Transfer Learning and bootstrapping with pre-trained networks are some of the tools that do provide potential shortcuts that reduce the need to train from scratch.

This is no means complete, but it is definitely a good starting point to guide the development of this entirely new kind of architecture.