首頁猿問如何創建預處理管道，包括內置的...

如何創建預處理管道，包括內置的 scikit 學習轉換器、自定義轉換器，其中之一用于特征工程？

Python

泛舟湖上清波郎朗 2023-06-20 10:49:09

概括我正在努力創建一個帶有內置轉換器和自定義轉換器的預處理管道，其中包括一個可以向數據添加額外屬性并進一步對添加的屬性執行轉換的轉換器。附加屬性示例：有一個 phValue 屬性缺少數據。我想嘗試創建一個附加屬性，該屬性將在 phLabel 列中將 phValue 標記為 (Acid, Neutral, Base)。也是每個序列特征的字符串長度。這將需要輸入 phValue 的缺失值，然后創建其他屬性和進一步的轉換器，這些轉換器也將轉換 sequence_length 屬性。我可怕的變壓器。這是我如何創建自定義轉換器的示例，我可以將其用于手動預處理，但是，在創建完整的預處理管道時，這不是處理它的正確方法。def data_to_frame(X):? ? if isinstance(X, pd.DataFrame):? ? ? ? return X? ? elif isinstance(X, sparse.csr_matrix):? ? ? ? return pd.DataFrame(X, indices, atributes)? ? elif isinstance(X, np.ndarray):? ? ? ? return pd.DataFrame(X, indices, atributes)? ? else:? ? ? ? raise Exception("Incorrect Data Structure Passed")class CombinedAttributesAdder(BaseEstimator, TransformerMixin):? ? def __init__(self, no_difference = True): # no *args or **kargs? ? ? ? self.no_difference = no_difference? ? def fit(self, X, y=None):? ? ? ? return self # nothing else to do? ? def transform(self, X):? ? ? ? atributes.extend(['sequence_length', 'difference', 'phLabel'])? ? ? ? sequence_length = X.sequence.str.len()? ? ? ? difference = X['residueCount'] - sequence_length? ? ? ? phLabel = X['phValue'].apply(ph_labels)? ? ? ? if self.no_difference:? ? ? ? ? ? atributes.append('no_difference')? ? ? ? ? ? no_difference = (difference == 0)? ? ? ? ? ? return np.c_[X, sequence_length, difference, phLabel, no_difference]? ? ? ? else:? ? ? ? ? ? return np.c_[X, sequence_length, difference, phLabel]變形金剛中的 Pandas 操作。我想在變形金剛中執行的操作特定于熊貓。我的解決方案是將輸入的 numpy 數組轉換為數據幀，并在轉換函數中將其作為 numpy 數組返回。我將全局變量用于屬性和索引。我意識到這是一種乏善可陳的方法。我如何在我的自定義轉換器中使用 pandas 操作？

查看完整描述

1 回答

尚方寶劍之說

TA貢獻1788條經驗獲得超4個贊

這應該按預期工作——很可能你的實現有問題——可以嘗試處理一個虛擬數據集。并不TransformerMixin真正關心輸入是numpy還是pandas.DataFrame，它將按“預期”工作。

import pandas as pd

import numpy as np

from sklearn.base import TransformerMixin

from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import FunctionTransformer

from sklearn.pipeline import make_pipeline

class CustomTransformer(TransformerMixin):

def __init__(self, some_stuff=None, column_names= []):

self.some_stuff = some_stuff

self.column_names = column_names

def fit(self, X, y=None):

return self

def transform(self, X):

# do stuff on X, and return dataframe

# of the same shape - this gets messy

# if the preceding item is a numpy array

# and not a dataframe

if isinstance(X, np.ndarray):

X = pd.DataFrame(X, columns=self.column_names)

X['str_len'] = X['my_str'].apply(lambda x: str(x)).str.len()

X['custom_func'] = X['val'].apply(lambda x: 1 if x > 0.5 else -1)

return X

df = pd.DataFrame({

'my_str': [111, 2, 3333],

'val': [0, 1, 1]

})

# mixing this works as expected

my_pipeline = make_pipeline(StandardScaler(), CustomTransformer(column_names=["my_str", "val"]))

my_pipeline.fit_transform(df)

# using this by itself works as well

my_pipeline = make_pipeline(CustomTransformer(column_names=["my_str", "val"]))

my_pipeline.fit_transform(df)

輸出是：

In [ ]: my_pipeline = make_pipeline(StandardScaler(), CustomTransformer(column_names=["my_str", "val"]))

...: my_pipeline.fit_transform(df)

Out[ ]:

my_str val str_len custom_func

0 -0.671543 -1.414214 19 -1

1 -0.742084 0.707107 18 1

2 1.413627 0.707107 17 1

In [ ]: my_pipeline = make_pipeline(CustomTransformer(column_names=["my_str", "val"]))

...: my_pipeline.fit_transform(df)

Out[ ]:

my_str val str_len custom_func

0 111 0 3 -1

1 2 1 1 1

2 3333 1 4 1

sklearn-pandas或者，如果您想直接將事物映射到數據框，則可以使用

from sklearn_pandas import DataFrameMapper

# using sklearn-pandas

str_transformer = FunctionTransformer(lambda x: x.apply(lambda y: y.str.len()))

cust_transformer = FunctionTransformer(lambda x: (x > 0.5) *2 -1)

mapper = DataFrameMapper([

(['my_str'], str_transformer),

(['val'], make_pipeline(StandardScaler(), cust_transformer))

], input_df=True, df_out=True)

mapper.fit_transform(df)

輸出：

In [ ]: mapper.fit_transform(df)

Out[47]:

my_str val

0 3 -1

1 2 1

2 1 1

使用 sklearn pandas 可以讓您更具體地將輸入作為數據框，將輸出作為數據框，并允許您將每一列單獨映射到每個感興趣的管道，而不是將列名編碼/硬編碼為對象的一部分TransformerMixin。

反對回復 2023-06-20

1 回答
0 關注
148 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何創建預處理管道，包括內置的 scikit 學習轉換器、自定義轉換器，其中之一用于特征工程？

如何創建預處理管道，包括內置的 scikit 學習轉換器、自定義轉換器，其中之一用于特征工程？

1 回答

添加回答

如何創建預處理管道，包括內置的 scikit 學習轉換器、自定義轉換器，其中之一用于特征工程？

如何創建預處理管道，包括內置的 scikit 學習轉換器、自定義轉換器，其中之一用于特征工程？