首頁猿問創建共現矩陣

創建共現矩陣

Python

胡說叔叔 2023-04-25 15:11:52

| 0 | 1 | 2 | 3_______________________________________________________________________________|0 | (-1.774, 1.145] | (-3.21, 0.533] |(0.0166, 2.007] | (2.0, 3.997]_______________________________________________________________________________|1 | (-1.774, 1.145] | (-3.21, 0.533] | (2.007, 3.993] | (2.0, 3.997]_______________________________________________________________________________我正在嘗試創建一個像上面這樣的數據集的共現矩陣，它有 800 條記錄和 12 個分類變量。我正在嘗試創建從每個變量到其他變量的每個類別的每個類別的共現矩陣

查看完整描述

2 回答

米琪卡哇伊

TA貢獻1998條經驗獲得超6個贊

OneHotEncoder()您可以使用和以直接的方式執行此操作np.dot()

將數據框中的每個元素轉換為字符串
使用單熱編碼器通過分類元素的唯一詞匯表將數據幀轉換為單熱
與自身進行點積以計算共現
使用同現矩陣和feature_names一個熱編碼器重新創建數據幀

#assuming this is your dataset

0 1 2 3

0 (-1.774, 1.145] (-3.21, 0.533] (0.0166, 2.007] (2.0, 3.997]

1 (-1.774, 1.145] (-3.21, 0.533] (2.007, 3.993] (2.0, 3.997]

from sklearn.preprocessing import OneHotEncoder

df = df.astype(str) #turn each element to string

#get one hot representation of the dataframe

l = OneHotEncoder()

data = l.fit_transform(df.values)

#get co-occurance matrix using a dot product

co_occurance = np.dot(data.T, data)

#get vocab (columns and indexes) for co-occuance matrix

#get_feature_names() has a weird suffix which I am removing for better readibility here

vocab = [i[3:] for i in l.get_feature_names()]

#create co-occurance matrix

ddf = pd.DataFrame(co_occurance.todense(), columns=vocab, index=vocab)

print(ddf)

(-1.774, 1.145] (-3.21, 0.533] (0.0166, 2.007] \

(-1.774, 1.145] 2.0 2.0 1.0

(-3.21, 0.533] 2.0 2.0 1.0

(0.0166, 2.007] 1.0 1.0 1.0

(2.007, 3.993] 1.0 1.0 0.0

(2.0, 3.997] 2.0 2.0 1.0

(2.007, 3.993] (2.0, 3.997]

(-1.774, 1.145] 1.0 2.0

(-3.21, 0.533] 1.0 2.0

(0.0166, 2.007] 0.0 1.0

(2.007, 3.993] 1.0 1.0

(2.0, 3.997] 1.0 2.0

正如您可以從上面的輸出中驗證的那樣，它正是共現矩陣應該是什么。

這種方法的優點是您可以使用transform單熱編碼器對象的方法對其進行縮放，并且大部分處理都發生在稀疏矩陣中，直到創建數據幀的最后一步，以提高內存效率。

反對回復 2023-04-25

吃雞游戲

TA貢獻1829條經驗獲得超7個贊

假設您的數據位于數據框 df 中。

然后，您可以在數據幀上執行 2 個循環，并在數據幀的每一行上執行兩個循環，如下所示：

from collections import defaultdict

co_occrence = defaultdict(int)

for index, row in df.iterrows():

for index2, row2 in df.iloc[index + 1:].iterrows():

for row_index, feature in enumerate(row):

for feature2 in row2[row_index + 1:]:

co_occrence[feature, feature2] += 1

反對回復 2023-04-25

2 回答
0 關注
132 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

創建共現矩陣

創建共現矩陣

2 回答

添加回答