已解決430363個問題，去搜搜看，總會有你想問的

如何在文本中獲取匹配的 n-gram 的偏移量

首頁猿問如何在文本中獲取匹配的...

如何在文本中獲取匹配的 n-gram 的偏移量

Python

嚕嚕噠 2022-05-24 15:54:34

我想匹配文本中的字符串（n-gram），并使用一種方法來獲得偏移量：string_to_match = "many workers are very underpaid" text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."所以結果我想得到一個像這樣的元組("matched", 44, 75)，其中 44 是開始，75 是結束。這是我構建的代碼，但它僅適用于 unigram。def extract_offsets(line, _len=len): words = line.split() index = line.index offsets = [] append = offsets.append running_offset = 0 for word in words: word_offset = index(word, running_offset) word_len = _len(word) running_offset = word_offset + word_len append(("matched", word_offset, running_offset - 1)) return offsetsdef get_entities(offsets): entities = [] for elm in offsets: if elm[0] == "string_to_match": # here string_to_match is only one word entities.append(elm) return entitiesoffsets = extract_offsets(text)entities = get_entities(offsets) # [("matched", start, end)]任何使之適用于字符串序列或 n-gram 的提示！

查看完整描述

1 回答

鴻蒙傳說

TA貢獻1865條經驗獲得超7個贊

您可以re.finditer()調用span()匹配對象上的方法來獲取匹配子字符串的開始和結束索引-

def m():

string_to_match = "many workers are very underpaid"

text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."

m = re.finditer(r'%s'%(string_to_match),text)

for x in m:

print x.group(0), x.span() # x.span() will return the beginning and the ending indices of the matched substring as a tuple

反對回復 2022-05-24

1 回答
0 關注
110 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何在文本中獲取匹配的 n-gram 的偏移量

如何在文本中獲取匹配的 n-gram 的偏移量

1 回答

添加回答