已解決430363個問題，去搜搜看，總會有你想問的

為什么 Keras.preprocessing.sequence 處理字符而不是單詞？

首頁猿問為什么...

為什么 Keras.preprocessing.sequence 處理字符而不是單詞？

Python

浮云間 2023-09-12 16:53:38

pad_sequences我正在致力于將語音轉錄為文本，并在 Keras 中使用時遇到了問題（我認為）。我預訓練了一個在數據幀上使用的模型pad_sequences，它將數據放入一個數組中，每個值的列數和行數都相同。然而，當我用于pad_sequences轉錄文本時，該語音字符串中的字符數就是作為 numpy 數組返回的行數。假設我有一個包含 4 個字符的字符串，那么它將返回一個4 X 500Numpy 數組。對于 6 個字符的字符串，它將返回6 X 500Numpy 數組等。我的澄清代碼：import speech_recognition as srimport pyaudioimport pandas as pdfrom helperFunctions import *jurors = ['Zack', 'Ben']storage = []storage_df = pd.DataFrame()while len(storage) < len(jurors): print('Juror' + ' ' + jurors[len(storage)] + ' ' + 'is speaking:') init_rec = sr.Recognizer() with sr.Microphone() as source: audio_data = init_rec.adjust_for_ambient_noise(source) audio_data = init_rec.listen(source) #each juror speaks for 10 seconds audio_text = init_rec.recognize_google(audio_data) print('End of juror' + ' ' + jurors[len(storage)] + ' ' + 'speech') storage.append(audio_text) cleaned = clean_text(audio_text) tokenized = tokenize_text(cleaned) padded_text = padding(cleaned, tokenized) #fix padded text elongating rows我使用輔助函數腳本：def clean_text(text, stem=False): text_clean = '@\S+|https?:\S|[^A-Za-z0-9]+' text = re.sub(text_clean, ' ', str(text).lower()).strip() #text = tf.strings.substr(text, 0, 300) #restrict text size to 300 chars return textdef tokenize_text(text): tokenizer = Tokenizer() tokenizer.fit_on_texts(text) return tokenizerdef padding(text, tokenizer): text = pad_sequences(tokenizer.texts_to_sequences(text), maxlen = 500) return text返回的文本將被輸入到預先訓練的模型中，我非常確定不同長度的行會導致問題。

查看完整描述

1 回答

偶然的你

TA貢獻1841條經驗獲得超3個贊

的Tokenizer方法例如fit_on_texts或texts_to_sequences期望文本/字符串列表作為輸入（顧名思義，即texts）。但是，您將單個文本/字符串傳遞給它們，因此它會迭代其字符，同時假設它實際上是一個列表！

解決此問題的一種方法是在每個函數的開頭添加檢查，以確保輸入數據類型實際上是列表。例如：

def padding(text, tokenizer):

if isinstanceof(text, str):

text = [text]

# the rest would not change...

您還應該為該tokenize_text函數執行此操作。進行此更改后，您的自定義函數將同時適用于單個字符串和字符串列表。

作為重要的旁注，如果您在問題中放入的代碼屬于預測階段，則存在一個基本錯誤：您應該使用訓練模型時使用的相同實例，以確保完成映射和標記Tokenizer化與訓練階段相同。實際上，為每個或所有測試樣本創建一個新實例是沒有意義的Tokenizer（除非它具有與訓練階段使用的相同的映射和配置）。

反對回復 2023-09-12

1 回答
0 關注
132 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

為什么 Keras.preprocessing.sequence 處理字符而不是單詞？

為什么 Keras.preprocessing.sequence 處理字符而不是單詞？

1 回答

添加回答

為什么 Keras.preprocessing.sequence 處理字符而不是單詞？