1 回答

TA貢獻1784條經驗 獲得超2個贊
我不確定這是否可以使用CountVectorizeror來完成TfidfVectorizer。我為此編寫了自己的函數,如下所示:
import pandas as pd
import numpy as np
import string
def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
樣本:
corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]
df = pd.DataFrame(corpus, columns=["Test"])
你的df會看起來像這樣:
Test
0 This is the first document. And this is what I...
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
4 I like coding in sklearn
5 This is a very good question
現在你可以申請contained_within_window如下:
sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
你得到:
2
您可以運行一個for循環來檢查不同的實例。你這個來構建你的 pandasdf并應用TfIdf它,這是直截了當的。
希望這可以幫助!
添加回答
舉報