Tuesday, February 11, 2014

Ruminating on Decision Trees

Decision trees are tree-like structures that can be used for decision making, classification of data, etc.
The following simple example (on the IBM SPSS Modeler Infocenter Site) shows a decision tree for making a car purchase.

Another example of a decision tree that can be used for classification is shown below. These diagrams are taken from the article available at - www.cse.msu.edu/~cse802/DecisionTrees.pdf‎

Any tree with a branching factor of 2 (only 2 leafs) is called as a "binary decision tree". Any tree with a variety of branching factors can be represented in an equivalent binary tree. For e.g. the below binary tree will evaluate to the same result as the first tree.

It is easy to see that such decision tree models can help us in segmentation. For e.g. segmentation of patients into high-risk and low-risk categories; high-risk credit vs. low risk credit; etc.
An excellent whitepaper on Decision Trees by SAS is available here.

Decision trees can also be used in predictive modeling - this is known as Decision Tree Learning of Decision Tree Induction. Other names for such tree models are classification trees or regression trees; aka Classification And Regression Tree (CART).
Essentially "Decision Tree Learning" is a data mining technique using which a decision tree is constructed by slicing and dicing the data using statistical algorithms. Decision trees are produced by algorithms that identify various ways of splitting a data set into branch-like segments.
For e.g. On Wikipedia, there is a good example of a decision tree that was constructed by looking at the historic data of titanic survivors.
Decision Tree constructed through Data Mining of Titanic passengers.
Once such a decision tree model has been created, it can be exported as a standard PMML file. This PMML file can then be used in a real time scoring engine such as JPMML.

There is another open source project called as 'OpenScoring' that uses JPMML behind the scenes and provides us with a REST API to score data against our model. A simple example (with probability prediction mode) for identifying a flower based on attributes is illustrated here: https://github.com/jpmml/openscoring 

Decision Trees can also be modeled in Rule Engines. IBM iLog BRMS suite (WODM) supports the modeling of rules as a Decision Tree.