亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

如何在文本中獲取匹配的 n-gram 的偏移量

如何在文本中獲取匹配的 n-gram 的偏移量

嚕嚕噠 2022-05-24 15:54:34
我想匹配文本中的字符串(n-gram),并使用一種方法來獲得偏移量:string_to_match = "many workers are very underpaid"  text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."所以結果我想得到一個像這樣的元組("matched", 44, 75),其中 44 是開始,75 是結束。這是我構建的代碼,但它僅適用于 unigram。def extract_offsets(line, _len=len):    words = line.split()    index = line.index    offsets = []    append = offsets.append    running_offset = 0    for word in words:        word_offset = index(word, running_offset)        word_len = _len(word)        running_offset = word_offset + word_len        append(("matched", word_offset, running_offset - 1))    return offsetsdef get_entities(offsets):    entities = []    for elm in offsets:        if elm[0] == "string_to_match": # here string_to_match is only one word            entities.append(elm)    return entitiesoffsets = extract_offsets(text)entities = get_entities(offsets) # [("matched", start, end)]任何使之適用于字符串序列或 n-gram 的提示!
查看完整描述

1 回答

?
鴻蒙傳說

TA貢獻1865條經驗 獲得超7個贊

您可以re.finditer()調用span()匹配對象上的方法來獲取匹配子字符串的開始和結束索引-


def m():

    string_to_match = "many workers are very underpaid"

    text = "The new york times claimed in a report that many workers are very underpaid in some africans countries."

    m = re.finditer(r'%s'%(string_to_match),text)

    for x in m:

        print x.group(0), x.span()     # x.span() will return the beginning and the ending indices of the matched substring as a tuple



查看完整回答
反對 回復 2022-05-24
  • 1 回答
  • 0 關注
  • 104 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號