首頁猿問在 python 中使用...

在 python 中使用 sklearn 計算 TF-IDF 用于變量 n-gram

Python

呼如林 2022-06-22 18:13:04

問題：使用 scikit-learn 查找特定詞匯的可變 n-gram 的命中數。解釋。我從這里得到了例子。想象一下，我有一個語料庫，我想找出有多少命中（計數）具有如下詞匯：myvocabulary = [(window=4, words=['tin', 'tan']), (window=3, words=['electrical', 'car']) (window=3, words=['elephant','banana'])我在這里所說的窗口是單詞可以出現的單詞跨度的長度。如下：'tin tan' 被擊中（4 個字以內）'tin dog tan' 被擊中（4 個字以內）'tin dog cat tan被擊中（4個字以內）'tin car sun eclipse tan' 沒有被擊中。tin 和 tan 相距超過 4 個單詞。我只想計算 (window=4, words=['tin', 'tan']) 出現在文本中的次數，所有其他的都相同，然后將結果添加到 pandas 以計算tf-idf 算法。我只能找到這樣的東西：from sklearn.feature_extraction.text import TfidfVectorizertfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')tfs = tfidf.fit_transform(corpus.values())其中詞匯表是一個簡單的字符串列表，可以是單個單詞或多個單詞。除了來自 scikitlearn：class sklearn.feature_extraction.text.CountVectorizerngram_range : tuple (min_n, max_n)要提取的不同 n-gram 的 n 值范圍的下邊界和上邊界。將使用所有滿足 min_n <= n <= max_n 的 n 值。也無濟于事。有任何想法嗎？謝謝。

查看完整描述

1 回答

一只斗牛犬

TA貢獻1784條經驗獲得超2個贊

我不確定這是否可以使用CountVectorizeror來完成TfidfVectorizer。我為此編寫了自己的函數，如下所示：

import pandas as pd

import numpy as np

import string

def contained_within_window(token, word1, word2, threshold):

word1 = word1.lower()

word2 = word2.lower()

token = token.translate(str.maketrans('', '', string.punctuation)).lower()

if (word1 in token) and word2 in (token):

word_list = token.split(" ")

word1_index = [i for i, x in enumerate(word_list) if x == word1]

word2_index = [i for i, x in enumerate(word_list) if x == word2]

count = 0

for i in word1_index:

for j in word2_index:

if np.abs(i-j) <= threshold:

count=count+1

return count

return 0

樣本：

corpus = [

'This is the first document. And this is what I want',

'This document is the second document.',

'And this is the third one.',

'Is this the first document?',

'I like coding in sklearn',

'This is a very good question'

]

df = pd.DataFrame(corpus, columns=["Test"])

你的df會看起來像這樣：

Test

0 This is the first document. And this is what I...

1 This document is the second document.

2 And this is the third one.

3 Is this the first document?

4 I like coding in sklearn

5 This is a very good question

現在你可以申請contained_within_window如下：

sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))

你得到：

您可以運行一個for循環來檢查不同的實例。你這個來構建你的 pandasdf并應用TfIdf它，這是直截了當的。

希望這可以幫助！

反對回復 2022-06-22

1 回答
0 關注
181 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

在 python 中使用 sklearn 計算 TF-IDF 用于變量 n-gram

在 python 中使用 sklearn 計算 TF-IDF 用于變量 n-gram

1 回答

添加回答