我想匹配文本中的字符串(n-gram),并使用一種方法來獲得偏移量:string_to_match = "many workers are very underpaid" text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."所以結果我想得到一個像這樣的元組("matched", 44, 75),其中 44 是開始,75 是結束。這是我構建的代碼,但它僅適用于 unigram。def extract_offsets(line, _len=len): words = line.split() index = line.index offsets = [] append = offsets.append running_offset = 0 for word in words: word_offset = index(word, running_offset) word_len = _len(word) running_offset = word_offset + word_len append(("matched", word_offset, running_offset - 1)) return offsetsdef get_entities(offsets): entities = [] for elm in offsets: if elm[0] == "string_to_match": # here string_to_match is only one word entities.append(elm) return entitiesoffsets = extract_offsets(text)entities = get_entities(offsets) # [("matched", start, end)]任何使之適用于字符串序列或 n-gram 的提示!
1 回答

鴻蒙傳說
TA貢獻1865條經驗 獲得超7個贊
您可以re.finditer()調用span()匹配對象上的方法來獲取匹配子字符串的開始和結束索引-
def m():
string_to_match = "many workers are very underpaid"
text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."
m = re.finditer(r'%s'%(string_to_match),text)
for x in m:
print x.group(0), x.span() # x.span() will return the beginning and the ending indices of the matched substring as a tuple
添加回答
舉報
0/150
提交
取消