首頁猿問只讀取文本文件中的完整單詞（詞法分...

只讀取文本文件中的完整單詞（詞法分析僅檢測整個單詞）的python代碼是什么

Python

臨摹微笑 2023-10-26 15:51:22

我想抓取構成口語中整個單詞的文本組（由空格分隔的文本組被視為單詞）。例如，當我想在文本文件中查找單詞is時，即使該文件不包含單詞 is ，也會檢測到單詞 s is ter 內的 is 。我對詞法分析有所了解，但無法將其應用到我的項目中。有人可以提供這種情況的 python 代碼嗎？這是我使用的代碼，但它導致了上述問題。 words_to_find = ("test1", "test2", "test3") line = 0 #User_Input.txt is a file saved in my computer which i used as the input of the system with open("User_Input.txt", "r") as f: txt = f.readline() line += 1 for word in words_to_find: if word in txt: print(F"Word: '{word}' found at line {line}, " F"pos: {txt.index(word)}")

查看完整描述

5 回答

FFIVE

TA貢獻1797條經驗獲得超6個贊

您應該使用spacy來標記您的列表，因為自然語言往往很棘手，包括所有例外情況和不包括在內：

from spacy.lang.en import English

nlp = English()

# Create a Tokenizer with the default settings for English

# including punctuation rules and exceptions

tokenizer = nlp.Defaults.create_tokenizer(nlp)

txt = f.readlines()

line += 1

for txt_line in txt:

? ? [print(f'Word {word} found at line {line}; pos: {txt.index(word)}') for word in nlp(txt)]

或者，您可以通過以下方式使用textblob ：

# from textblob import TextBlob

txt = f.readlines()

blob = TextBlob(txt)

for index, word in enumerate(list(blob.words)):

? ? line = line + 1

? ? print(f'Word {word.text} found in position {index} at line {line}')

反對回復 2023-10-26

嚕嚕噠

TA貢獻1784條經驗獲得超7個贊

用于nltk以可靠的方式標記您的文本。另外，請記住文本中的單詞可能會混合大小寫。在搜索之前將它們轉換為小寫。

import nltk
words = nltk.word_tokenize(txt.lower())

反對回復 2023-10-26

狐的傳說

TA貢獻1804條經驗獲得超3個贊

一般的正則表達式，以及\b具體的術語，意思是“單詞邊界”，是我將單詞與其他任意字符分開的方式。這是一個例子：

import re

# words with arbitrary characters in between

data = """now is; the time for, all-good-men

to come\t to the, aid of

their... country"""

exp = re.compile(r"\b\w+")

pos = 0

while True:

m = exp.search(data, pos)

if not m:

break

print(m.group(0))

pos = m.end(0)

結果：

now

the

time

for

all

good

men

come

the

aid

their

country

反對回復 2023-10-26

倚天杖

TA貢獻1828條經驗獲得超3個贊

您可以使用正則表達式：

import re

words_to_find = ["test1", "test2", "test3"] # converted this to a list to use `in`

line = 0

with open("User_Input.txt", "r") as f:

? txt = f.readline()

? line += 1

? rx = re.findall('(\w+)', txt) # rx will be a list containing all the words in `txt`

? # you can iterate for every word in a line

? for word in rx: # for every word in the RegEx list

? ? if word in words_to_find: print(word)

? ? # or you can iterate through your search case only

? ? # note that this will find only the first occurance of each word in `words_to_find`

? ? for word in words_to_find # `test1`, `test2`, `test3`...

? ? ? if word in rx: print(word) # if `test1` is present in this line's list of words...

上面的代碼的作用是將(\w+)正則表達式應用于您的文本字符串并返回匹配列表。在這種情況下，正則表達式將匹配任何由空格分隔的單詞。

反對回復 2023-10-26

慕容森

TA貢獻1853條經驗獲得超18個贊

如果您嘗試在文本文件中查找單詞 test1、test2 或 test3，則不需要手動增加行值。假設文本文件中的每個單詞都在單獨的行上，則以下代碼有效

words_to_find = ("test1", "test2", "test3")

file = open("User_Input.txt", "r").readlines()

for line in file:

txt = line.strip('\n')

for word in words_to_find:

if word in txt:

print(F"Word: '{word}' found at line {file.index(line)+1}, "F"pos: {txt.index(word)}")

我不知道立場意味著什么。

反對回復 2023-10-26

5 回答
0 關注
245 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

只讀取文本文件中的完整單詞（詞法分析僅檢測整個單詞）的python代碼是什么

只讀取文本文件中的完整單詞（詞法分析僅檢測整個單詞）的python代碼是什么

5 回答

添加回答