亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

<strong id="7ymf3"></strong>

<del id="7ymf3"></del>

我的購物車

已加入門課程

購物車里空空如也

快去這里選購你中意的課程

我的訂單中心

全部開發者教程

TensorFlow 入門教程

TensorFlow 簡介、安裝與快速入門

TensorFlow 簡介 TensorFlow 安裝 - CPU TensorFlow 安裝 - GPU TensorFlow 快速入門示例

TensorFlow 模型的簡潔表示-Keras

Keras 簡介使用 tf.keras 進行圖片分類使用 Keras 進行文本分類使用 Keras 進行回歸在 Keras 中保存與加載模型在 Keras 中進行模型的評估 Keras 中的Masking 與 Padding

TensorFlow 中的數據格式

TensorFlow 中的數據核心使用 TensorFlow 加載 CSV 數據使用 TensorFlow 加載 Numpy 數據使用 TF 加載 DateFrame 數據使用圖像數據來訓練模型在 TensorFlow 之中使用文本數據 TF 之中的 Unicode 數據格式的處理

TensorFlow模型的高級表示-Estimat

使用預設的 Estimator 模型將Keras模型轉化為Estimator模型 Estimator實現BoostingTree模型

TensorFlow 高級技巧

過擬合問題 TensorFlow 中的回調函數文本數據嵌入在 TensorFlow 之中使用卷積神經網絡在 TensorFlow 之中使用循環神經網絡在 TensorFlow 之中使用注意力模型在 TensorFlow 之中進行遷移學習在 TensorFlow 之中進行數據增強在 TensorFlow 之中進行圖像分割如何進行多 GPU 的分布式訓練？使用 tf.function 提升效率使用 TF HUB 進行模型復用

TensorFlow高級技巧-自定義

使用 TensorFlow 進行微分操作在 TensorFlow 之中自定義網絡層與模型在 TensorFlow 之中自定義訓練

TF 框架中的可視化工具-TensorBoard

TensorBoard 的簡介與快速上手使用 TensorBoard 記錄訓練中的各項指標在 TensorBoard 之中查看模型結構圖在 TensorBoard 之中記錄圖片數據

首頁慕課教程 TensorFlow 入門教程在 TensorFlow 之中使用文本數據

夜流歌 · 更新于 2020-10-16

上一節

使用圖像數據來訓練模型

TF 之中的 Unicode 數據格式的處理

下一節

在 TensorFlow 之中使用文本數據

在之前的學習之中，我們曾經學習過如何進行文本分類，但是歸根結底我們都是采用 TensorFlow 內置的 API 來直接獲取數據集的 Dataset ，而沒有真正的從文本文件中加載數據集。

在實際的機器學習任務之中，我們的數據集不可能每個都由 TensorFlow 提供，大多數的數據都是我們自行加載的。而對于文本數據，我們使用最多的數據格式就是 txt 數據格式，因此這節課我們來學習如何從文本文件中使用文本數據。

要使用文本數據，我們大致可以分為兩個步驟：

使用 tf.data.TextLineDataset 加載文本數據；
使用編碼將數據進行編碼。

1. 使用 tf.data.TextLineDataset 加載文本數據

在 TensorFlow 之中加載文本數據最常用的方式就是采用 TensorFlow 中的內置函數使用 tf.data.TextLineDataset 加載文本數據進行加載。

由于該 API 的存在，在 TensorFlow 之中加載數據變得非常簡單、快捷。

在這里，我們先使用谷歌倉庫中的 txt 作為一個示例，大家可以使用自己的 txt 文件進行測試。

import tensorflow as tf
import os

txt_path = tf.keras.utils.get_file('derby.txt', origin='https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt')

dataset = tf.data.TextLineDataset(txt_path).map(lambda x: (x, 0))
dataset.shuffle(1000).batch(32)

print(dataset)
for data in labeled_dataset.take(4):
  print(data)

在這里，我們要注意以下幾點：

首先我們使用 tf.data.TextLineDataset 函數來加載 txt 文件，該函數會將其自動轉化為 tf.data.Dataset 對象；
然后我們對每條數據進行了映射處理，因為數據集需要含有標簽，而我們的 txt 不含標簽，因此我們使用 0 作為暫時的標簽；
再者我們使用 shuffle 對數據集進行了隨機化處理，然后又進行了分批的處理，這里的批大小為 32 ；
最后我們查看了前四條數據。

于是我們可以得到結果：

<MapDataset shapes: ((), ()), types: (tf.string, tf.int32)>
(<tf.Tensor: shape=(), dtype=string, numpy=b"\xef\xbb\xbfOf Peleus' son, Achilles, sing, O Muse,">, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'The vengeance, deep and deadly; whence to Greece'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Unnumbered ills arose; which many a soul'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Of mighty warriors to the viewless shades'>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)

可以發現，我們已經成功創建了數據集，但是沒有進行編碼處理，這顯然是不適合直接進行機器學習的。

2. 使用編碼將數據進行編碼

我們可以使用 tensorflow_dataset.features.text.Tokenizer 對象進行編碼處理，該對象能夠將接收到的句子進行編碼。同時，我們可以通過 tensorflow_dataset.features.text.TokenTextEncoder 函數進行編碼器的構建。

import tensorflow_datasets as tfds

tokenizer = tfds.features.text.Tokenizer()

vocab = set()
for text, l in dataset:
  token = tokenizer.tokenize(text.numpy())
  vocab.update(token)

print(len(vocab))

于是我們可以得到輸出：

然后我們可以進行編碼操作（以下映射方式參考于 TensorFlow 官方文檔）：

# 定義編碼器
encoder = tfds.features.text.TokenTextEncoder(vocab)

def encode(text, label):
  encoded_text = encoder.encode(text.numpy())
  return encoded_text, label

# 使用tf.py_function進行映射
def encode_map_fn(text, label):
  encoded_text, label = tf.py_function(encode, inp=[text, label], Tout=(tf.int32, tf.int32))

  # 手動設置形狀Shape
  encoded_text.set_shape([None])
  label.set_shape([])

  return encoded_text, label

# 進行編碼處理
encoded_data_set = dataset.map(encode_map_fn)
print(encoded_data_set)
for data in encoded_data_set.take(4):
  print(data)

在這里，我們進行了以下幾步操作：

我們首先使用 tfds.features.text.TokenTextEncoder 對象構造了編碼器；
然后我們對每個數據進行了映射處理；
在每個映射操作之中，我們使用 tf.py_function 函數進行映射操作；這是因為，如果在 map 函數之中調用 Tensor.numpy() 函數會報錯，因此需要使用 tf.py_function 進行映射操作；
最后，因為 tf.py_function 不會設置數據的形狀 Shape ，因此我們需要手動設置 Shape 。

于是，我們可以得到輸出：

<MapDataset shapes: ((None,), ()), types: (tf.int32, tf.int32)>
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([7755, 4839, 4383, 5722, 4996, 2065, 8059], dtype=int32)>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(8,), dtype=int32, numpy=array([ 855, 5184,  700, 8356, 5931, 5665, 4634, 7127], dtype=int32)>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([1620, 6817, 5649, 5461, 5505,  209, 3146], dtype=int32)>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([7755, 1810, 3656, 4634, 4920, 1136, 6789], dtype=int32)>, <tf.Tensor: shape=(), dtype=int32, numpy=0>)

于是我們可以發現，我們的數據集已經成功編碼，現在可以便可以使用該數據集進行模型的訓練了。

3. 小結

在這節課之中，我們學習了如何在 TensorFlow 之中使用文本數據。總體而言，在大多數的學習任務之中都需要我們手動載入文本數據，我們一方面可以通過 tf.data.TextLineDataset 加載文本數據，另外一方面我們需要使用 tensorflow_dataset.features.text.Tokenizer 進行文本的編碼處理。

圖片描述

上一節

使用圖像數據來訓練模型

下一節

TF 之中的 Unicode 數據格式的處理

我要提出意見反饋

索引目錄

在 TensorFlow 之中使用文本數據

1. 使用 tf.data.TextLineDataset 加載文本數據

2. 使用編碼將數據進行編碼

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

掃描二維碼
關注慕課網微信公眾號