%pylab inline pylab.rcParams['figure.figsize'] = (18,8) from sklearn import tree from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_absolute_error, mean_squared_error from math import sqrt reg = DecisionTreeRegressor(max_depth=5) reg.fit(X_train, y_train) y_test_pred = reg.predict(X_test) test_error = mean_squared_error(y_test, y_test_pred) print(" - Validation RMSE:", sqrt(test_error)) print(" - Validation MAE:", mean_absolute_error(y_test_pred, y_test)) tree.plot_tree(reg, filled=True, feature_names=list(df.columns)) plt.show()
Decision trees, which can be used for both regression and classification, make it possible to explain a value from a series of discrete or continuous variables. They are quite powerful methods, non-parametric and non-linear, consisting in:
This hierarchy allows to visualize the results in a tree, and to build explicit predictive rules.
Several iterations are necessary. At each iteration :
Each leaf is characterized by a specific path through the tree, called a rule. The set of rules for all the leaves is the model. The interpretation of a rule is easy if we obtain pure leaves. Otherwise, one must rely on the empirical distribution of the variable to be predicted at each node of the tree.
A decision tree can very quickly lead to overlearning, so it is necessary to prune the tree: stop at an adequate number of leaves when building the tree.
To build the tree, various questions arise:
Three main algorithms exist to build such decision trees by answering the above questions: CART, C4.5 and CHAID. They proceed as follows:
Decision trees are rather used as weak classifiers based on ensemblistic methods.