page 1  (10 pages)
2to next section

Decision Tree Induction Using Genetic

Programming

G?okhan T?ur and Halil Altay G?uvenir
Department of Computer Engineering and Information Science
Faculty of Engineering, Bilkent University, 06533 Bilkent, Ankara, Turkey
ftur,[email protected]

Abstract. A decision tree induction algorithm using genetic programming (GP) is presented. The best decision tree is defined as the one which achieves maximum accuracy with minimum number of internal nodes. In this approach every individual is a decision tree candidate. The results are satisfactory in the sense that it can find the optimum solution, i.e. the best decision tree, for a small sized dataset. For the dataset related to American Congress, this algorithm can reach an accuracy of 97.3%. Solutions to unknown or noise data are also proposed in this framework.

Keywords. Machine Learning, Genetic Programming, Decision Trees, Genetic Algorithms, Artificial Intelligence

1 Introduction

Genetic algorithms (GA), by combining the survival of the fittest among string structures, (called individuals) with a randomized genetic information exchange, try to form a search algorithm similar to the evolution process in the nature. Algorithm begins with a population of randomly generated individuals, and the fitness" of each individual is measured. The fittest individuals in the first generation have a higher number of offsprings in the second generation. GA is covered in detail in [2] and [3].

Genetic programming (GP) is a method of Adaptive Automatic Program Induction", originally created by John Koza and James Rice of Stanford University [4]. GP is similar to GA, except that, individuals are no longer strings, but parse trees of the programs. The programs are composed of elements from a function-set and a terminal-set, which are typically fixed sets of symbols selected to be appropriate for the solution of problems in the domain of interest. Initial population is created randomly. The genetic information exchange is done by taking randomly selected subtrees in the individual programs and exchanging them. This corresponds to the crossover operation in GA. Because of the