Multitask Constructive Induction

Peter Drake

Computer Science Department
Indiana University
Bloomington, IN 47405
pedrake@cs.indiana.edu

Abstract

The difficulty of a supervised learning problem can depend greatly on the way the training data are encoded. When the structure of the problem is not known, the data must be encoded with primitive, atomic features. Constructive induction, the automatic production of higher-level combinations of the atomic features, can greatly facilitate learning.

The FLIP algorithm uses an information gain metric to find useful conjunctions of atomic features and their negations. These features offer improved performance for decision tree learning on individual problems. Furthermore, FLIP can be used to produce conjunctions useful over a class of problems. The features so produced are shown to aid learning, both for the problems to which FLIP had access, and for new problems in the same class.

Introduction

One difficulty in inductive learning is that the concept being learned is not always easily expressed in the language used to describe the training examples. The concept of `check' is not a simple one when expressed in terms of the Cartesian coordinates of Chess pieces, and one would not want to describe what a chair looks like in terms of pixel brightnesses. Nonetheless, machine learning programs are routinely asked to learn concepts from just such low-level descriptions, because high-level descriptions can be difficult or even impossible to acquire.

The process of automatically producing high-level features is called constructive induction (Michalski, 1983). The FLIP algorithm uses an information gain metric to construct conjunctions of atomic, Boolean features and their negations. It has been demonstrated (Drake, 1995) that FLIP offers improvement in single-task learning in a variety of circumstances. This paper focuses on new results showing FLIP's usefulness in learning sets of related tasks.

The following section discusses the information gain metric and constructive induction. The FLIP algorithm is explained in the section thereafter. New experimental results are then presented, followed by a discussion of related work. The final section is a summary of conclusions and areas for future research.

Previous Work

Information Gain

This subsection borrows heavily from (Quinlan, 1993).

Information gain is defined in terms of entropy. Entropy is a measure of the expected amount of information conveyed by an as-yet-unseen message from a known set.

The amount of information conveyed by a message, in bits, is minus the base-two logarithm of the probability of that message. For example, if there are 8 equally probably messages, receiving any one of them conveys -log2(1/8) = 3 bits of information. Less probable messages convey more information, and vice versa. The expected amount of information conveyed by any message is simply the sum over all possible messages, weighted by their probabilities.

In the context of supervised learning, the possible `messages' are the classes into which the data fall, and the probability of a class is the portion of the training cases which are labelled with that class. The entropy of the training set is the expected amount of information conveyed by the label of a case. A training set evenly split across the classes therefore has maximal entropy (1 if there are two classes), while one containing only examples of one class has entropy 0.

If the data can be divided into subsets by some useful test (i.e., examining one of the features), each subset will have less entropy than the whole set. The split entropy for some feature is the sum of the entropies of the subsets resulting from the split, weighted by their sizes as fractions of the size of the original set.

The information gained by splitting on some feature is simply the original entropy minus the split entropy of the feature. In growing a decision tree, the feature offering the greatest information gain is selected at each step.

Constructive Induction

One drawback of decision trees is that trees describing certain functions contain a great deal of replication. This replication problem (Pagallo and Haussler, 1990) can lead to very large (and, with incomplete or noisy data, inaccurate) decision trees.

The replication problem is just one example of feature interaction or blurring (Rendell and Ragavan, 1993). When a feature provides useful information only in the context of other features, the information gain metric may not be able to find good features. The worst case is the parity problem, where the class of an example can be determined if all n atomic features are known, but not if any n - 1 of them are known.

One approach to dealing with feature interaction is constructive induction (Michalski, 1983). Constructive induction refers to the creation of new features which are functions of the atomic features. The hidden units of a backpropagation neural network may be said to perform constructive induction, creating compound features which greatly simplify the functions the output units are required to learn (Rumelhart et al, 1986).

In addition to guiding the learner to relevant combinations of features, constructive induction can provide a vocabulary of high-level features leading to more concise concept descriptions. Internal representations produced by constructive induction can improve generalization, giving better performance in the case of noisy or incomplete data.

Another intriguing possibility is that constructive induction may be used to discover high-level features which are common to a set of related problems. For example, a vision program might learn an internal representation which ignores rotation and absolute brightness, and a medical diagnoser might discover combinations of symptoms which are relevant to several diseases.

FLIP

FLIP, the Feature Learning Inductive Preprocessor, finds compound features from a data set. The features can then be fed to a standard supervised learner, such as a decision tree learner or a neural network.

FLIP uses a beam search to find conjunctions of the atomic features and their negations. When learning features for a single problem, the `goodness' measure is the information gain metric. When learning features for a set of problems, the measure is the average information gain over all of the problems.

In pseudocode, the algorithm is:

  1. Initialize beam to set of literals (the atomic features and their negations), all marked `live'.
  2. Until none of the features in beam are `live':
    1. Mark every feature in beam as `dead'.
    2. Find each new feature which can be made by conjoining a literal to an extant feature. Mark the new features `live'.
    3. Throw out all but the width best features.
  3. Return all features in beam comprising more than one literal.
For example, consider complete data for the function (A^B^C)v(D^E^F). In the first pass through the loop, every two-literal conjunction (A^B, A^~B, et cetera) is constructed. If width is set to 10, the 10 best features survive the first iteration. These features are:
Feature		Information Gain

E^F		0.13
D^F		0.13
D^E		0.13
B^C		0.13
A^C		0.13
A^B		0.13
~C^~F		0.11
~C^~E		0.11
~C^~D		0.11
~B^~F		0.11
The top six are the two-literal subsets of the terms of the function, and the next four are some of the clauses of the conjunctive normal form negation of the function. The next iteration produces the two terms of the function (which have gain 0.31), the six two-literal subsets, and two of the CNF negation clauses. Another iteration finds some spurious four-literal features, and the last iteration finds nothing new, so the algorithm terminates. At the front of the list of features returned are the `ideal' features A^B^C and D^E^F.

Experimental Results

Quinlan's C4.5 program was used as a decision tree learner for these experiments. The beam width was set to 20. There were 30 atomic features (some of which may have been irrelevant). Each training or testing set contained 100 data points, and the training and testing sets were disjoint for each function. Results reported are means over 20 trials.

In each trial, 11 related CNF functions were generated. The functions were related in that they drew their clauses from a common pool. FLIP was used to find features for each of the first n functions, with n varying from 0 through 10. (n = 0 indicates not using FLIP.) For each number of features, the decision tree learner tried to learn both FLIP's first function and the 11th function, which FLIP never saw, using the atomic features plus any generated by FLIP.

In the first experiment, CNF functions of four, three-literal clauses were used, with the clauses drawn from a common pool of 10 clauses. FLIP offered dramatic improvement at the start on both the previously-learned fuction (Figure 1) and on the new function (Figure 2). FLIP continued to offer improvement for n up to 10 on the new function, and up to about 4 for the previously-learned function.

Figure 1: Error rates (mean +/- standard error) for random CNF functions of four, three-literal clauses, drawn from a pool of 10 clauses. The test function was the first function learned. According to a directional Spearman rank-order correlation coefficient test, these results are not significant at the 0.05 level (rs = -0.527).

Figure 2: Error rates (mean +/- standard error) for random CNF functions of four, three-literal clauses, drawn from a pool of 10 clauses. The test function was a newly-generated function in the same class. According to a directional Spearman rank-order correlation coefficient test, these results are significant (rs = -0.864, p < 0.001).

A second experiment was run on more difficult functions: CNF functions of ten, four-literal clauses, with the clauses drawn from a common pool of 20 clauses. The results are shown in Figure 3 (previously-learned function) and Figure 4 (new function). While the percentage improvement with FLIP is not as striking this time, it is significant in both cases.

Figure 3: Error rates (mean +/- standard error) for random CNF functions of ten, four-literal clauses, drawn from a pool of 20 clauses. The test function was the first function learned. According to a directional Spearman rank-order correlation coefficient test, these results are significant (rs = -0.782, p < 0.005).

Figure 4: Error rates (mean +/- standard error) for random CNF functions of ten, four-literal clauses, drawn from a pool of 10 clauses. The test function was a newly-generated function in the same class. According to a directional Spearman rank-order correlation coefficient test, these results are significant (rs = -0.800, p < 0.005).

Related Work

Some other research has been done on constructive induction specifically for decision trees. Two recent systems are FRINGE (Pagallo and Haussler, 1990) and LFC (Ragavan and Rendell, 1993).

FRINGE alternates tree-growing and feature construction. Features are constructed from the last nodes on the path from the root to each leaf in the most recently grown tree. Features thus produced are used in successively grown trees, and may in turn be used to construct even larger features. The algorithm halts when no new features are found, or a limit on the total number of features is reached.

LFC also intersperses tree-growing and feature construction, performing a beam search between each recursive call to the tree-grower. Using clever search techniques which emphasize reducing the entropy of the subset of training data which satisfies the feature, LFC is able to evaluate many fewer features.

Since both of these approaches construct features while growing a tree, they cannot be used to find features useful over several tasks. FLIP solves this problem by staying at the top level rather than recursing into subsets of the data.

Conclusions and Future Work

FLIP has been previously shown to offer significant improvement on single-task learning problems (Drake, 1995). In this paper, evidence has been presented for FLIP's helpfulness in learning sets of related tasks.

One severe limitation of the current FLIP system is that it can only handle Boolean features. Information theoretic techniques can be used to find good decision-tree splits for other types of features (Quinlan, 1993), so it may be possible to construct compound features in a similar manner.

In addition, since FLIP is a preprocessor, there is no real reason to use the features it constructs only in decision trees. Other inductive learning methods, such as neural networks or genetic algorithms, may be able to take advantage of FLIP-constructed features. It would also be interesting to see if such features could help provide more useful `nearness' metrics for nearest-neighbor methods.

Acknowledgements

I would like to thank Prasad Tadepalli, Tom Dietterich, and Mike Muchisky for their suggestions and comments.

Bibliography

Dietterich, T. and Shavlik, J., eds. (1990). Readings in Machine Learning. San Mateo, CA: Morgan Kaufmann.

Drake, P. (1995). Constructive Induction for Improved Learning of Boolean Functions. Master's Thesis, Oregon State University.

Michalski, R.S. (1983). A theory and methodology of inductive learning. In Artificial Intelligence, 20, 111-116. Reprinted in (Dietterich & Shavlik, 1990) above.

Pagallo, G. and Haussler, D. (1990). Boolean Feature Discovery in Empirical Learning. Machine Learning, 5, 71-99.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann.

Ragavan, H., and Rendell, M. (1993). Lookahead feature construction for learning hard concepts. In Proceedings of the Tenth International Conference on Machine Learning, 252-259. San Mateo, CA: Morgan Kaufmann.

Rendell, M., and Ragavan, H. (1993). Improving the design of induction methods by analyzing algorithm functionality and data-based concept complexity. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, 952-958. San Mateo, CA: Morgan Kaufmann.

Rumelhart, D. E., Hinton, G. E., and Williams, J. (1986). Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 318-362. Cambridge, MA: The MIT Press.