Re-posted from The Huffington Post – extended version.
“But this technology has been around for years now!” That’s the refrain I commonly hear these days when talking to more seasoned AI practitioners about the resurgence of AI and machine learning. The origins of many of the scaled machine learning techniques in vogue today can be traced all the way back to the advent of computing in the late forties. Much progress was made in developing techniques such as Artificial Neural Networks (ANNs) in the eighties and early nineties, but the applications did not scale beyond toy problems and proofs of concept. Breakthroughs in scaling ANNs have now led to a resurgence of interest and much wider application of these techniques, now better known as Deep Learning.
I was chatting with a fellow presenter on the side of a Deep Learning conference recently and he described an interesting debate he had had with some of his peers at Stanford: what other technologies that over promised and under-delivered back, twenty years ago, are prime for a resurgence now? They’re consensus, I was delighted to find out, was Evolutionary Algorithms.
Like ANNs, Evolutionary Algorithms (EAs) are techniques inspired by models of biological intelligence, applied to real-world problem solving. In the case of EAs, the approach is population-based. Here’s how it works: A model of the desired outcome is used for producing candidate solutions. For example, let’s say we are given ten numbers and the task is to order them from smallest to largest:
2, 24, 70, 1, 19, 23, 27, 66, 28, 38
The model for the desired outcome is a list of size 10 that includes all 10 numbers. In other words, as long as we have a full list of ten numbers, in any particular order they may be, we have a candidate.
The system then generates a population of such candidates, all of which comply with the model, but that differ from one another. To generate this population, the easiest thing to do is to throw dice, and generate the candidates randomly:
1, 70, 2, 24, 66, 38, 27, 23, 28, 19
38, 27, 19, 2, 28, 1, 70, 66, 23, 24
24, 1, 23, 66, 2, 19, 70, 38, 27, 28
70, 24, 38, 28, 66, 23, 19, 1, 2, 27
Well, obviously none of the above lists are our solution, because none are ordered from smallest to largest. However, some are closer to the solution than the others. In other words, some seem to be more ordered. One way to measure how ‘ordered’ the lists are with respect to one another, is to go through each list and count how many times we run into a number that is not greater than the prior number. In the first example above, 70 is greater than 1, so that’s good, but 2 is not greater than 70, and we can count that as a point against that list being completely ordered. This count for the first list is 5, but for the second list it is 6, and for the third one, it is 4.
This method of measuring the relative distance of the candidates from the solution is called a fitness function, and it helps the system rank order the candidates in the population based on ‘fittest’ to ‘least fit’ to be the solution. In our little example above, this translates to the third list being more ordered, because it has a fitness of 4, versus the second list being less ordered.
That’s pretty much all you need to get a very basic population based method to work: a model for the solution and a fitness function. In our example, you simply generate as many candidates as possible, always keeping the one that has the best fitness so far. With a fitness function that does a decent job of distinguishing between candidates, and a randomized enough way to generate candidates that fit the model so we get a suitably diverse population to pick from, and with enough iterations, you will get closer to the solution.
The most famous type of EAs are Genetic Algorithms (GAs), first introduced by John Holland in the late seventies. Holland introduced the notion of directing the random generation of candidates so as to give the fitter candidates from a preceding population a larger role in the generation of new candidates. In our example above, for instance, rather than generating a new candidate completely randomly all the time, we would occasionally take one of the fitter lists, say the third one:
24, 1, 23, 66, 2, 19, 70, 38, 27, 28
and make a minor change to it – like swapping two of the numbers at random. This is known as the mutation operator, and it could go both ways. If we are lucky enough, and the system randomly picks two numbers to swap that were in the wrong order (e.g., 70 and 38), then the resulting ‘offspring’ candidate will have a better fitness (in this case, it goes down from 4 to 3).
24, 1, 23, 66, 2, 19, 38, 70, 27, 28
Like in nature, however, most mutations are not beneficial. The reason for this is that a candidate that is already pretty fit is more likely to become less fit if you mess with it. Having said this, mutations are quite useful, because in the rare occasions when they do improve a candidate’s fitness, they can result in new global optimums. They can also help diversify the population of the fittest candidates while maintaining most of the traits.
Another reproductive operation popularized by Holland is the crossover. For this one, two ‘parent’ candidates are selected, from the fitter candidates in the population, and they are merged into at least two offspring – the idea being that at least one of the two would turn out to be fitter than the parents. In our example, we would pick, say, the first and third lists as parents, we would then pick a few numbers from one of the parents and copy them into the offspring, and to complete the offspring’s list we would take the remaining numbers in the order they appear in the second parent
The first thing that struck me, and I suspect many other practitioners, as fascinating about evolutionary algorithms, is the fact that they often work, and sometimes produce unexpected results – a delight that seems to be the hallmark of AI systems. Like ANNs, these simplified simulations seem to prove out our models of how nature operates, and, I think, this was the main impetus and drive behind the initial adoption of these methods.
Like ANNs, the use of EAs hit a slump once people realized that they do not scale as easily to solve large real-world problems, and simpler methods, mostly grounded in mathematics, such as Support Vector Machines, prevailed.
Scaling a Machine Learning algorithm requires the ability to run the algorithm on many machines/processors, but that is not the only requirement: the algorithm should be able to actually make productive use of the distributed processing capacity. On the face of it, an evolutionary algorithm is an example of what computer scientists often refer to as embarrassingly parallelizable. After all, an evolutionary system is comprised of a population of individual candidates which can each be validated independently.
John Koza, a student of Holland’s, and the father of Genetic Programming (GP), a form of EA commonly used for symbolic regression, was one of the first to exploit the parallelizability of EAs in order to scale his system. He used a Beowulf cluster of 1000 CPUs to run GP optimizations for, amongst other things, automatic circuit design. The results were quite impressive: his system was able to reinvent 15 logic circuits patented between 1917 and 2001, and 9 other circuits, some of which may well be patentable (including algorithms for quantum computing).
EAs are also quite suitable for multi-objective problems. This is when you have more than one fitness function. A car, for example, can be measured by two objectives: its mpg, and its performance. Optimizing for either might not be desirable, but there may be a range of acceptable candidates if we optimize on the combination of the two. This implies that we are not optimizing for a single solution, but a population of solutions, all of which might be acceptable.
In his 2003 book, Milestones in Computer Science and Information Technology, Edwin D. Reilly noted: “genetic programming holds the greatest promise for eventual development of a computer that can pass the Turing test. And this would be the very way that Turing himself envisioned that it might happen”. In fact, Koza inspired an annual competition of his own, called the Humies, for Human-Competitive Results Produced by Genetic and Evolutionary Computation.
Recent breakthroughs in scaling EAs go beyond distributing the evaluation of the candidates. In these systems, each processing node maintains and evolves its own population of candidates independently, occasionally exchanging genetic material with other nodes. These systems can also operate on large data sets by incrementally evaluating the candidates on subsets of the data, adjusting and improving their notion of the fitness of each promising candidate as it is run on more and more data samples. For example, let’s say the candidate is an electronic circuit that needs to be validated on many different combinations of inputs. Rather than running every single candidate on every single possible input combination, the system runs it on a subset to start with. If it fairs well enough, compared to other candidates, then it can be kept around and evaluated on more input combinations. These breakthroughs have allowed for systems of massive scale running on millions of processing nodes, tackling much more ambitious real-world problems.
There is, however, a fundamental difference between EAs and biological evolution. In EA, the unit of evolution is a candidate solution, which is evaluated independent of other individuals. In biological evolution, individuals evolve in an environment consisting of other individuals, with specializations and interactions bringing about emergent behaviors at various levels of cooperation and competition. This makes for a very complex system indeed. In fact, this was what Holland had in mind in the first place, and GAs were a simplification of the complex systems he researched.
Artificial Life (ALife) is a branch of AI dealing with the understanding and harnessing of complex systems. The individuals in ALife were originally finite automatons with predefined behavior which led to interesting chaotic emergent patterns of behavior (e.g., Conway’s Game of Life), but the field has grown to systems that evolve individuals with a low level set of actions, interacting in a common environment (e.g., Avida).
Imagine, for instance, a set of possible actions, including reading two numbers from a list, comparing two numbers, swapping the order of two numbers, repeating an action, and running an action depending on the outcome of another action. Now let us generate a population of individuals, each capable of applying one or more of these actions, and let us say that their decision to apply the action is informed by a mapping of their sense of their environment, including the state of other individuals, to the choice of action. Now we somehow input the list of numbers we would like to order into the system, allowing the individuals to sense them, should they have the actions to do so. We let the individuals loose on the data, and regularly check on the state. The fitness of the system is reflected in how much more or less ordered are the numbers. This fitness is a result of the individuals in the system collaborating or competing in their respective, relative, and related environments. The key question here is how does the global fitness of the system mapped to the reward or penalty of the individuals? This is called the fitness attribution problem. There are solutions for is in certain limited application areas, such as in distributed constraint satisfaction problems.
In recent years, inspired by the seminal work of Mark Bedau in 1997, showing a lack of Evolutionary Activity in ALife systems compared to biological evolution, there has been a back-to-basics rethinking of the design of ALife systems, focusing on systems that are capable of learning to learn. In other words, these systems allow the program code (i.e., the computational DNA) to be modifiable by an individual’s executing program. Surprisingly, this rethinking seems to have taken us full circle to the bio-inspired finite automata designs of none other than Von Neumann himself, published posthumously in 1966.
One of the most interesting and inspiring works I have ran into recently is that of Dave Ackley at the University of New Mexico. His model for robust-first computing advocates massive distribution of an ALife system for problem solving, with no single point of failure. One of the applications he demonstrates in his proof-of-concept simulation is, in fact, that of ordering a stream of numbers. The numbers are fed into the system in a distributed manner, they are sorted in a distributed manner, and even outputted over multiple nodes, so that the system is completely resilient to any sort of damage to the underlying infrastructure.
So are EAs primed for resurgence now? It is quite possible that, by virtue of their distributability, EAs emerge as the next big thing in AI, allowing us to tackle, in a reasonable amount of time, more ambitious problems using the ever expanding data sets being collected.
Babak Hodjat, Co-founder and chief scientist