Evolutionary Computing for Feature Selection and Predictive Data Mining

Abstract

Feature selection has recently been the subject of intensive research in data mining, especially for datasets with a large number of descriptive attributes such as QSAR (Quantitative Activity Structure Relationship) data. QSAR is an in-silico drug design methodology, which requires identifying important features of molecules that explain a drug relevant activity of interest. A typical QSAR dataset for predicting an activity of interest is characterized by a large number of descriptive features (300;1000) for a relatively small number of compounds (typically around 50;500). Finding the best feature subset for a given problem with N number of features requires evaluating all 2^N possible subsets. The best feature subset also depends on the predictive modeling, which will be employed to predict the future unknown values of response variables of interest. Feature selection involves minimizing the number of relevant features for maximizing the predictive power of the model. From this point of view feature selection can be viewed as a special type of multi-objective optimization problem. Evolutionary computing can be applied to problems where traditional methods are hard to apply or lead to unsatisfactory solutions (e.g. local optima). The methods of evolutionary computation are stochastic and their search methods imitate and model some phenomena from nature and evolution: (i)the survival of the fittest and (ii)genetic inheritance. This dissertation addresses evolutionary algorithms for feature selection and predictive modeling for QSAR data sets.