首頁猿問引用和標記多特征...

引用和標記多特征 TensorFlow 數據集中的單個特征列

Python

12345678_0001 2023-08-15 16:34:33

我正在嘗試對 TensorFlow 數據集中的單個列進行標記。如果只有一個特征列，我一直使用的方法效果很好，例如：text = ["I played it a while but it was alright. The steam was a bit of trouble." " The more they move these game to steam the more of a hard time I have" " activating and playing a game. But in spite of that it was fun, I " "liked it. Now I am looking forward to anno 2205 I really want to " "play my way to the moon.", "This game is a bit hard to get the hang of, but when you do it's great."]target = [0, 1]df = pd.DataFrame({"text": text, "target": target})training_dataset = ( tf.data.Dataset.from_tensor_slices(( tf.cast(df.text.values, tf.string), tf.cast(df.target, tf.int32))))tokenizer = tfds.features.text.Tokenizer()lowercase = Truevocabulary = Counter()for text, _ in training_dataset: if lowercase: text = tf.strings.lower(text) tokens = tokenizer.tokenize(text.numpy()) vocabulary.update(tokens)vocab_size = 5000vocabulary, _ = zip(*vocabulary.most_common(vocab_size))encoder = tfds.features.text.TokenTextEncoder(vocabulary, lowercase=True, tokenizer=tokenizer)然而，當我嘗試在有一組特征列的情況下執行此操作時，比如說從（每個特征列被命名的地方）出來，make_csv_dataset上述方法失敗了。( ValueError: Attempt to convert a value (OrderedDict([]) to a Tensor.)。并更改tokenizer.tokenize(text.numpy())為tokenizer.tokenize(text)引發另一個錯誤TypeError: Expected binary or unicode string, got <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>

查看完整描述

2 回答

寶慕林4294392

TA貢獻2021條經驗獲得超8個贊

錯誤只是tokenizer.tokenize需要一個字符串，而你給它一個列表。這個簡單的編輯將會起作用。我只是做了一個循環，將所有字符串提供給分詞器，而不是給它一個字符串列表。

dataset = tf.data.experimental.make_csv_dataset(

'test.csv',

batch_size=2,

label_name='target',

num_epochs=1)

tokenizer = tfds.features.text.Tokenizer()

lowercase = True

vocabulary = Counter()

for features, _ in dataset:

text = features['text']

if lowercase:

text = tf.strings.lower(text)

for t in text:

tokens = tokenizer.tokenize(t.numpy())

vocabulary.update(tokens)

反對回復 2023-08-15

哈士奇WWW

TA貢獻1799條經驗獲得超6個贊

創建的數據集的每個元素make_csv_dataset都是CVS 文件的一批行，而不是單個行；這就是為什么它需要batch_size作為輸入參數。另一方面，for用于處理和標記文本特征的當前循環期望一次單個輸入樣本（即行）。因此，tokenizer.tokenize給定一批字符串會失敗并引發TypeError: Expected binary or unicode string, got array(...).

以最小的更改解決此問題的一種方法是首先以某種方式取消批處理數據集，對數據集執行所有預處理，然后再次對數據集進行批處理。unbatch幸運的是，我們可以在這里使用一個內置方法：

dataset = tf.data.experimental.make_csv_dataset(

? ? ...,

? ? # This change is **IMPORTANT**, otherwise the `for` loop would continue forever!

? ? num_epochs=1

)

# Unbatch the dataset; this is required even if you have used `batch_size=1` above.

dataset = dataset.unbatch()

#############################################

# Do all the preprocessings on the dataset here...

##############################################

# When preprocessings are finished and you are ready to use your dataset:

#### 1. Batch the dataset (only if needed for or applicable to your specific workflow)

#### 2. Repeat the dataset (only if needed for or applicable to specific your workflow)

dataset = dataset.batch(BATCH_SIZE).repeat(NUM_EPOCHS or -1)

@NicolasGervais 的答案中建議的另一種解決方案是調整和修改所有預處理代碼，以處理一批樣本，而不是一次處理單個樣本。

反對回復 2023-08-15

2 回答
0 關注
143 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

引用和標記多特征 TensorFlow 數據集中的單個特征列

引用和標記多特征 TensorFlow 數據集中的單個特征列

2 回答

添加回答