PONIRA TECHNOLOGIES: Classification Using Streaming Random Forests

Classification Using Streaming Random Forests

Abstract:

We consider the problem of data stream classification, where the data arrive in a conceptually infinite stream, and the opportunity to examine each record is brief. We introduce a stream classification algorithm that is online, running in amortized Oð1Þ time, able to handle intermittent arrival of labeled records, and able to adjust its parameters to respond to changing class boundaries (“concept drift”) in the data stream. In addition, when blocks of labeled data are short, the algorithm is able to judge internally whether the quality of models updated from them is good enough for deployment on unlabeled records, or whether further labeled records are required. Unlike most proposed stream-classification algorithms, multiple target classes can be handled. Experimental results on real and synthetic data show that accuracy is comparable to a conventional classification algorithm that sees all of the data at once and is able to make multiple passes over it.

Existing System:

incremental classification algorithm which uses a multi-resolution data representation to find adaptive nearest neighbors of a test point.

Proposed system:

a stream classification algorithm that is online, running in amortized time, able to handle intermittent arrival of labeled records, and able to adjust its parameters to respond to changing class boundaries (“concept drift”) in the data stream.

The Standard Random Forests Algorithm:

The Random Forests algorithm is an ensemble classification technique developed by Breiman As with any tree ensemble classifier, it grows a number of binary decision trees and predicts the class of each new record using the plurality of the class predictions from the set of trees. However, it differs from standard ensemble techniques in the way in which records are selected to grow each tree, and the way in which attributes are selected at each internal node. Suppose that a data set contains n records, each with m attributes. Each tree is grown by:

· Choosing a subset of the n records, at random with replacement, to form a training set.

· For each internal node, a subset of M (M _ m) randomly chosen attributes are selected, and the decision about which attribute and split point is made using the standard Gini index algorithm on only the selected attributes. A typical value of M (suggested by Breiman) is log2 m þ 1.

· The tree is left unpruned.The random selection of records with replacement leaves some records (about a third of them) that are never used in the building of this tree. These records can be used to estimate the error rate of each tree, even as it is being built. This process is repeated with a fresh subset of the records to produce a user-specified number of trees. The choice of attribute and split point at each internal node of each tree is made in a doubly contextualized way—it depends on the other records that were chosen to build this tree, and on the other attributes that were chosen to build this internal node. This property makes overlearning impossible and the predictions of the ensemble extremely robust.

PONIRA TECHNOLOGIES

Wednesday, October 19, 2011

Classification Using Streaming Random Forests

1 comment: