首頁手記基于Keras的文本語料處理

基于Keras的文本語料處理

標簽：

機器學習深度學習自然語言處理

Demo1:

from keras.preprocessing.text import text_to_word_sequence,one_hot,Tokenizer
from keras.preprocessing.sequence import pad_sequences 
s1 = 'hello this is xiaoming! How are you ?'
s2 = 'I am fine thank you and you ?'
s3 = 'I am fine too !'

#英文分词
wordslist1 = text_to_word_sequence(s1)
wordslist2 = text_to_word_sequence(s2)
wordslist3 = text_to_word_sequence(s3)  
print(wordslist1)

#one-hot 
#vocab_size = 10000 
oh1 = one_hot(s1,10000)
oh2 = one_hot(s2,10000)
oh3 = one_hot(s3,10000)
for each in [oh1,oh2,oh3]:
	print(each) 

#padding文本补齐：
pad_oh = pad_sequences([oh1,oh2,oh3],maxlen = 16,padding = 'post')
for each in pad_oh:
	print(each)

Demo2:

from keras.preprocessing.text import text_to_word_sequence,one_hot,Tokenizer
from keras.preprocessing.sequence import pad_sequences 
#词频统计，得到字典，文本向量化
s1 = 'Evaluate on the evaluation data'
s2 = 'Convolutional Neural Networks for Sentence Classification models'
s3 = 'Sentence Classifications with Neural Networks'



#原始语料库
corpus = 'Convolutional Neural Networks Neural Networks have recently  \
          been shown to achieve impressive results                     \
          on the practically important task of sentence categorization'

#在进行向量表示的时候，会只表示最常见的most_freq_num个词。当然，可以不设置。
most_freq_num = 8 
t = Tokenizer(most_freq_num)

#利用语料库进行训练，学习关于这个语料的统计信息
t.fit_on_texts([corpus])

#得到所有词的一个词频统计字典
word_c = t.word_counts
print("unique words:",word_c)
print("word count:",len(word_c))  

#给每个词都分配一个index,按照在语料库出现的词频来排序
word_indexs = t.word_index
print("word indexs:",word_indexs)

#将一系列句子转化成矩阵。选择count模式，则得到的每句子对应的行表示是“最常见的most_freq_num个词各自的词频”
#例如，[0. 2. 1. 1. 0. 0.] 表示的是最常见的6个词分别出现了0,2,1,1,0,0次。
t_matrix = t.texts_to_matrix([s1,s2,s3],mode = 'count')
print("texts matrix: \n",t_matrix)
print("text matrix shape: ",t_matrix.shape)


#将一系列句子转化成由词的index构成的向量。其中，不到most_freq_num的词则为空值
text_sequence = t.texts_to_sequences([s1,s2,s3])
print("texts to sequence: \n",text_sequence) 

#将各个向量pad一下，得到定长的sentence
text_pad = pad_sequences(text_sequence,padding = 'post')
print("Padded sentence vectors: \n",text_pad)

参考教程：
https://beyondguo.github.io/2019-03-18-Keras-Text-Preprocessing/

點擊查看更多內容

為 TA 點贊

若覺得本文不錯，就分享一下吧！

評論

評論

共同學習，寫下你的評論

評論加載中...

展開查看更多評論

作者其他優質文章

正在加載中

Coder_zheng

算法工程師

手記
篇

粉絲

23

獲贊與收藏

46

關注作者，訂閱最新文章

閱讀免費教程

后端通用面試教程

41個小節 32210 359

網絡編程入門教程

20個小節 13298 250

Pandas 入門教程

25個小節 19917 373

推薦

評論

收藏

共同學習，寫下你的評論



感謝您的支持，我會繼續努力的～

掃碼打賞，你說多少就多少

贊賞金額會直接到老師賬戶

支付方式

打開微信掃一掃，即可進行掃碼打賞哦

今天注冊有機會得

100積分直接送

付費專欄免費學

大額優惠券免費領

立即參與放棄機會

點擊
抽獎

慕課手記新用戶專享福利

恭喜你，你的運氣太好了，居然抽中了 100個積分！

恭喜你，抽中了價值元的專欄！

太棒了，直接落到你賬戶里！

積分商城里的羅技鼠標、機械鍵盤、
Kindle 閱讀器、小米平衡車
Apple iPad （10.2英寸）、大額優惠券
在等著你去兌換了噢

作者：

免費贈送

兌換碼：1111222211 復制

優惠券可用于購買實戰課、體系課
無門檻使用

先去看看，有什么好東西馬上兌換我愛學習，選課去


亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

基于Keras的文本語料處理

閱讀免費教程