1 回答

TA貢獻1830條經驗 獲得超9個贊
我意識到一種方法是構建一個 max_depth=1 的決策樹。這將執行分裂成兩片葉子。然后挑出雜質最高的葉子繼續分裂,再次將決策樹擬合到這個子集上,如此重復。為確保層次結構清晰可見,我重新標記了 leaf_ids,以便清楚地看到,當您在樹上向上移動時,ID 值會下降。這是一個例子:
import numpy as np
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
def decision_tree_one_path(X, y=None, min_leaf_size=3):
nobs = X.shape[0]
# boolean vector to include observations in the newest split
include = np.ones((nobs,), dtype=bool)
# try to get leaves around min_leaf_size
min_leaf_size = max(min_leaf_size, 1)
# one-level DT splitter
dtmodel = DecisionTreeClassifier(splitter="best", criterion="gini", max_depth=1, min_samples_split=int(np.round(2.05*min_leaf_size)))
leaf_id = np.ones((nobs,), dtype='int64')
iter = 0
if y is None:
y = np.random.binomial(n=1, p=0.5, size=nobs)
while nobs >= 2*min_leaf_size:
dtmodel.fit(X=X.loc[include], y=y[include])
# give unique node id
new_leaf_names = dtmodel.apply(X=X.loc[include])
impurities = dtmodel.tree_.impurity[1:]
if len(impurities) == 0:
# was not able to split while maintaining constraint
break
# make sure node that is not split gets the lower node_label 1
most_impure_node = np.argmax(impurities)
if most_impure_node == 0: # i.e., label 1
# switch 1 and 2 labels above
is_label_2 = new_leaf_names == 2
new_leaf_names[is_label_2] = 1
new_leaf_names[np.logical_not(is_label_2)] = 2
# rename leaves
leaf_id[include] = iter + new_leaf_names
will_be_split = new_leaf_names == 2
# ignore the other one
tmp = np.ones((nobs,), dtype=bool)
tmp[np.logical_not(will_be_split)] = False
include[include] = tmp
# now create new labels
nobs = np.sum(will_be_split)
iter = iter + 1
return leaf_id
leaf_id 因此是按順序觀察的葉子 ID。因此,例如 leaf_id==1 是第一個被拆分成終端節點的觀察結果。leaf_id==2 是下一個從生成 leaf_id==1 的拆分中拆分出來的終端節點,如下所示。因此有 k+1 個葉子。
#0
#|\
#1 .
# |\
# 2 .
#.......
#
# |\
# k (k+1)
不過,我想知道是否有一種方法可以在 Python 中自動執行此操作。
添加回答
舉報