Variance Based Selection to Improve Test Set Performance in Genetic Programming


Abstract

This paper proposes to improve the performance of Genetic Programming (GP) over unseen data by minimizing the variance of the output values of evolving models along-with reducing error on the training data. Variance is a well understood, simple and inexpensive statistical measure; it is easy to integrate into a GP implementation and can be computed over arbitrary input values even when the target output is not known.
Moreover, we propose a simple variance based selection scheme to decide between two models (individuals). The scheme is simple because, although it uses bi-objective criteria to differentiate between two competing models, it does not rely on a multi-objective optimisation algorithm. In fact, standard multi-objective algorithms can also employ this scheme to identify good trade-offs such as those located around the knee of the Pareto Front.
The results indicate that, despite some limitations, these proposals significantly improve the performance of GP over a selection of high dimensional (multi-variate) problems from the domain of symbolic regression. This improvement is manifested by superior results over test sets in three out of four problems, and by the fact that performance over the test sets does not degrade as often witnessed with standard GP; neither is this performance ever inferior to that on the training set. As with some earlier studies, these results do not find a link between expressions of small sizes and their ability to generalise to unseen data.