I recently came across a wonderful bird’s eye-view paper in JMLR [1]. It is helping me to clarify my views about the relationship between Machine Learning and Evolutionary Computation.
Bennett and Parrado-Hernandez remark:
“Optimization lies at the heart of machine learning. Most machine learning problems reduce to optimization problems. Consider the machine learning analyst in action solving a problem for some set of data. The modeler formulates the problem by selecting an appropriate family of models and massages the data into a format amenable to modeling. Then the model is typically trained by solving a core optimization problem that optimizes the variables or parameters of the model with respect to the selected loss function and possibly some regularization function. In the process of model selection and validation, the core optimization problem may be solved many times. The research area of mathematical programming theory intersects with machine learning through these core optimization problems”
and later,
“Convexity plays a key role in mathematical programming [MP]. Convex programs minimize convex optimization functions subject to convex constraints ensuring that every local [optimum] is always a global [optimum]. In general, convex problems are much more tractable algorithmically and theoretically. … General nonconvex programs are NP-hard”
This paper lends support to my sense that the optimization techniques used in machine learning are largely limited to convex optimization algorithms. Successes in machine learning have so far been a result of the application of human ingenuity to carry out effective semi-principled reductions of certain classes of problems to convex optimization problems. I use the qualifier semi-principled because there is usually some solid rationale behind an ML reduction that leads one to expect that the reduction makes the original problem amenable to solution by convex optimization techniques. This however is never formally proven. i.e. the reduction is not formal in the sense that the traveling salesman problem formally reduces to SAT. In all likelihood the optimization problems that the redutions yield are non-convex. However in practice, the application of convex optimiaztion techniques to these optimization problems often gives satisfactory solutions to the original problems. Hence the qualifer effective.
For example researchers in the support vector machine community use the kernel trick in an effective semi-principle reduction of classification and regression problems to a class of convex optimization problems called quadratic programming problems. These problems are then solved using quadratic programing algorithms such as Platt’s SMO algorthm.
To summarize the discussion so far: convex optimization problems are a class of optimization problems in which every local optimum is also a global optimum. Many useful algorithms for efficiently solving convex optimization problems have been constructed by the mathematical programming community. One of the reasons for the success of Machine Learning techniques is the use of human ingenuity to achieve effective semi-principled reductions of certain classes of problems to convex optimization problems for which there exist efficient solvers within MP.
Another reason, as authors mention, is that machine learning researchers have been tweaking the convex optimization algorithms of mathematical programming to create new algorithms which are better suited for their purposes. “Mathematical Programming puts a premium on accuracy, speed, and robustness. Since generalization is the bottom line in machine learning and training is normally done off-line, accuracy and small speed improvements are of little concern in machine learning. Machine learning prefers simpler algorithms that work in reasonable computational time for specific classes of problems” [1]. The paper later mentions that optimization is often cut short (!) in Machine Learning to prevent overfitting This “technique” is called early-stopping. And later ‘Thus not only is “good” optimization not necessary, but “bad” optimization algorithms can lead to better machine learning models’.
Here are some reformulations of this paper and some of my thoughts.
Let me differentiate between optimization and adaptation. I’ll define optimization as the procurement of one or more points of optimal or close-to-optimal value, and adaptation as the generation of points of increasing value over time. Things become quite clear in light of this distinction. Mathematical Programming seems to be almost completely concerned with optimization whereas machine learning seems to require relatively quick adaptation.
The love affair between mathematical programming and Machine learning exists because MP provides the world with a set of crisp, well-defined problems — e.g. a linear programming problem, a quadratic programming problem, a second-order cone programming problem, a semdefinite programming problem, a semi-infinite programming problem (see the appendix in [1] for a description of each of these convex optimization problems) — and then provides programmatic techniques for solving those problems. To state this using metaphors from software engineering (particularly object oriented programming), these well-defined problems are the interfaces that MP publishes, and the efficient algorithmic solvers within that field implement these interfaces.
The interface-problems published by the MP community give ML researchers a rough target to hit in their efforts to carry out effective semi-principled reductions of difficult problems to ones for which efficient solvers exist. The interface-problems of MP are similar to the optimization problems that ML researchers ultimately end up with, but there are important differences: for example, MPers seek fast optimization whereas MLers merely want a high degree of adaptation, and have additional requirements such as scalability. Nevertheless, if a ML researcher works out a semi-principled reduction of a class problems to one of MP’s interface-problems, there are off-the-shelf algorithms within MP which allow her to quickly test whether her reduction is effective. MLers put great stock in the fact that each convex optimization algorithm in the MP pantheon has robust performance over its problem class and is theoretically proven to converge to the optimum.
Machine learning researchers spend their time trying to achieve effective reductions of difficult problems to problem classes for which efficient and robust solvers exist. An effective reduction may be a multistep affair (effectively reduce problem A to problem B, effectively reduce problem B to problem C, ….) but it most often bottoms out in an effective reduction to an optimization problem, most often a convex optimization problem. Different techniques and tricks may be used at each step of an effective reduction (e.g. probabilistic models, the kernel trick, etc.).
Convex optimization problems are rather easy as optimization problems go. Therefore a large amount of human ingenuity is required to carry out effective semi-principled reductions of classes of real world problems to convex optimization problems. These reductions often involves the use of “heavy machinary” from computer science, mathematics and statistics.
It seems to me that fundamental advances by the machine learning community are currently being limited by two factors (besides computational speed). The first is the sheer level of human ingenuity that is required to realize novel reductions of real world problems to convex optimization problems. The second is that the targets of these reductions have so far been limited to convex-optimization problems. The first factor cannot be changed. With regard to the second, I believe that the field of evolutionary computation may have a lot to contribute . Selecto-recombinative evolutionary algorithms are widely believed to perform efficient adaptation even on non-convex problems. Unfortunately there are currently no crisp descriptions of the classes problems that such algorithms can efficiently ‘solve’. I believe that if and when such interface-problems are determined the ML community will take much greater interest in Evolutionary Computation. EC might then play the same role w.r.t. ML that MP currently plays, i.e. ML researchers might then seek effective semi-principled reductions of real-world problems to interface-problems in the EC pantheon.
References:
[1] Bennett, K.P., Parrado-Hernandez, E. The interplay of optimization and machine learning research, Journal of Machine Learning Research 7 (2006) 1265-1281