1 回答

TA貢獻1788條經驗 獲得超4個贊
這應該按預期工作——很可能你的實現有問題——可以嘗試處理一個虛擬數據集。并不TransformerMixin真正關心輸入是numpy還是pandas.DataFrame,它將按“預期”工作。
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline
class CustomTransformer(TransformerMixin):
def __init__(self, some_stuff=None, column_names= []):
self.some_stuff = some_stuff
self.column_names = column_names
def fit(self, X, y=None):
return self
def transform(self, X):
# do stuff on X, and return dataframe
# of the same shape - this gets messy
# if the preceding item is a numpy array
# and not a dataframe
if isinstance(X, np.ndarray):
X = pd.DataFrame(X, columns=self.column_names)
X['str_len'] = X['my_str'].apply(lambda x: str(x)).str.len()
X['custom_func'] = X['val'].apply(lambda x: 1 if x > 0.5 else -1)
return X
df = pd.DataFrame({
'my_str': [111, 2, 3333],
'val': [0, 1, 1]
})
# mixing this works as expected
my_pipeline = make_pipeline(StandardScaler(), CustomTransformer(column_names=["my_str", "val"]))
my_pipeline.fit_transform(df)
# using this by itself works as well
my_pipeline = make_pipeline(CustomTransformer(column_names=["my_str", "val"]))
my_pipeline.fit_transform(df)
輸出是:
In [ ]: my_pipeline = make_pipeline(StandardScaler(), CustomTransformer(column_names=["my_str", "val"]))
...: my_pipeline.fit_transform(df)
Out[ ]:
my_str val str_len custom_func
0 -0.671543 -1.414214 19 -1
1 -0.742084 0.707107 18 1
2 1.413627 0.707107 17 1
In [ ]: my_pipeline = make_pipeline(CustomTransformer(column_names=["my_str", "val"]))
...: my_pipeline.fit_transform(df)
Out[ ]:
my_str val str_len custom_func
0 111 0 3 -1
1 2 1 1 1
2 3333 1 4 1
sklearn-pandas或者,如果您想直接將事物映射到數據框,則可以使用
from sklearn_pandas import DataFrameMapper
# using sklearn-pandas
str_transformer = FunctionTransformer(lambda x: x.apply(lambda y: y.str.len()))
cust_transformer = FunctionTransformer(lambda x: (x > 0.5) *2 -1)
mapper = DataFrameMapper([
(['my_str'], str_transformer),
(['val'], make_pipeline(StandardScaler(), cust_transformer))
], input_df=True, df_out=True)
mapper.fit_transform(df)
輸出:
In [ ]: mapper.fit_transform(df)
Out[47]:
my_str val
0 3 -1
1 2 1
2 1 1
使用 sklearn pandas 可以讓您更具體地將輸入作為數據框,將輸出作為數據框,并允許您將每一列單獨映射到每個感興趣的管道,而不是將列名編碼/硬編碼為對象的一部分TransformerMixin。
添加回答
舉報