首頁猿問在單詞列表中查找復合詞 - Python

在單詞列表中查找復合詞 - Python

Python

肥皂起泡泡 2024-01-16 15:15:37

我有一個需要過濾的簡單單詞列表，但列表中的每個單詞都附加了一個附帶的“分數”，這給我帶來了一些麻煩。輸入列表具有以下結構：lst = ['FAST;5','BREAK;60','FASTBREAK;40', 'OUTBREAK;110','BREAKFASTBUFFET;35', 'BUFFET;75','FASTBREAKPOINTS;60' ]我試圖弄清楚如何識別列表中僅由同一列表中的其他單詞組成的單詞。例如，應用于lst上面的代碼將產生：ans = ['FASTBREAK:40','BREAKFASTBUFFET;35']我發現一個先前的問題涉及幾乎相同的情況，但在這種情況下，列表中的單詞沒有跟蹤分數，并且我在處理列表中的這些跟蹤分數時遇到了麻煩。該ans列表必須保留找到的復合詞的分數。中的單詞順序lst是隨機且無關的。理想情況下，我希望 ans 列表按單詞的長度（在之前' ; '）排序，如上所示。這將為我節省一些額外的 ans 后處理。我已經找到了一種使用 ReGex 和嵌套 for 循環的方法（我會讓你免去我 1980 年代風格的暴力代碼的丑陋，它真的不漂亮），但我的單詞列表有接近一百萬個條目，我的解決方案需要很長時間才能完全無法使用。我正在尋找一個我可以實際使用的更具 Python 風格的解決方案。我很難解決這個問題。

查看完整描述

3 回答

泛舟湖上清波郎朗

TA貢獻1818條經驗獲得超3個贊

有多種方法可以加快該過程，但我懷疑是否存在多項式解決方案。

因此，讓我們使用多重處理，并盡我們所能來生成有意義的結果。下面的示例與您所要求的并不相同，但它確實從一本大詞典中組成了一個明顯復合詞的列表。

對于下面的代碼，我正在獲取https://gist.github.com/h3xx/1976236，其中按英語頻率順序列出了大約 80,000 個唯一單詞。

如果預先按字母順序對輸入單詞列表進行排序，則可以輕松加快下面的代碼，因為復合詞的每個頭將立即跟隨其潛在的復合成員：

black

blackberries

blackberry

blackbird

blackbirds

blackboard

blackguard

blackguards

blackmail

blackness

blacksmith

blacksmiths

正如評論中提到的，您可能還需要使用語義過濾器來識別真正的復合詞 - 例如，“一般”一詞不是“基因集會”的復合詞！因此，雖然您可能會獲得競爭者列表，但您需要以某種方式消除誤報。

# python 3.9

import multiprocessing as mp

# returns an ordered list of lowercase words to be used.

def load(name) -> list:

return [line[:-1].lower() for line in open(name) if not line.startswith('#') and len(line) > 3]

# function that identifies the compounds of a word from a list.

# ... can be optimised if using a sorted list.

def compounds_of(word: str, values: list):

return [w for w in values if w.startswith(word) and w.removeprefix(word) in values]

# apply compound finding across an mp environment

# but this is the slowest part

def compose(values: list) -> dict:

with mp.Pool() as pool:

result = {(word, i): pool.apply(compounds_of, (word, values)) for i, word in enumerate(values)}

return result

if __name__ == '__main__':

# https://gist.github.com/h3xx/1976236

words = load('wiki-100k.txt') # words are ordered by popularity, and are 3 or more letters, in lowercase.

words = list(dict.fromkeys(words))

# remove those word heads which have less than 3 tails

compounds = {k: v for k, v in compose(words).items() if len(v) > 3}

# get the top 500 keys

rank = list(sorted(compounds.keys(), key=lambda x: x[1]))[:500]

# compose them into a dict and print

tops = {k[0]: compounds[k] for k in rank}

print(tops)

反對回復 2024-01-16

鳳凰求蠱

TA貢獻1825條經驗獲得超4個贊

我想我會這樣做：創建一個僅包含單詞的新列表。在 for 循環中遍歷此列表，并在其中查找屬于外循環單詞的一部分的單詞。如果找到：用空字符串替換找到的部分。如果隨后整個單詞被空字符串替換：顯示原始列表中相應索引的單詞。

編輯：正如評論中所指出的，在某些情況下代碼可能存在問題，例如：lst = ["BREAKFASTBUFFET;35", "BREAK;60", "BREAKFAST;18", "BUFFET;75"] 在 BREAKFASTBUFFET 中，我首先發現 BREAK 是其中的一部分，所以我用空字符串替換了該代碼，這阻止找到早餐。我希望可以通過按單詞長度降序對列表進行排序來解決這個問題。

編輯2
我之前的編輯并不是完美無缺的，比如有一個詞BREAKFASTEN，它不應該被BREAKFAST“吃掉”。該版本執行以下操作：

列出候選者列表：調查中單詞的所有單詞
制作另一個以該單詞開頭的單詞列表
跟蹤候選列表中您已經嘗試過的單詞
一會兒正確：繼續嘗試，直到開始列表為空，或者您已成功地將所有單詞替換為候選單詞

lst = ['FAST;5','BREAK;60','FASTBREAK;40',

'OUTBREAK;110','BREAKFASTBUFFET;35',

'POINTS;25',

'BUFFET;75','FASTBREAKPOINTS;60', 'BREAKPOINTS;15'

]

lst2 = [ s.split(';')[0] for s in lst ]

for i, word in enumerate(lst2):

# candidates: words that are part of current word

candidates = [ x for i2, x in enumerate(lst2) if x in word and i != i2 ]

if len(candidates) > 0:

tried = []

word2 = word

found = False

while not found:

# start: subset of candidates that the current word starts with

start = [ x for x in candidates if word2.startswith(x) and x not in tried ]

for trial in start:

word2 = word2.replace(trial,'')

tried.append(trial)

if len(word2)==0:

print(lst[i])

found = True

break

if len(candidates)>1:

candidates = candidates[1:]

word2=candidates[0]

else:

break

反對回復 2024-01-16

萬千封印

TA貢獻1891條經驗獲得超3個贊

這是一些完成這項工作的代碼。我確信它并不適合您的情況（有一百萬個條目），但也許在某些方面有用：

#!/usr/bin/env python

from collections import namedtuple

Word = namedtuple("Word", ("characters", "number"))

separator = ";"

lst = [

"FAST;5",

"BREAK;60",

"FASTBREAK;40",

"OUTBREAK;110",

"BREAKFASTBUFFET;35",

"BUFFET;75",

"FASTBREAKPOINTS;60",

]

words = [Word(*w.rsplit(separator, 1)) for w in lst]

def findparts(oword, parts):

if len(oword.characters) == 0:

return parts

for iword in words:

if not parts and iword.characters == oword.characters:

continue

if iword.characters in oword.characters:

parts.append(iword)

characters = oword.characters.replace(iword.characters, "")

return findparts(Word(characters, oword.number), parts)

return []

ans = []

for word in words:

parts = findparts(word, [])

if parts:

ans.append(separator.join(word))

print(ans)

它使用遞歸函數，獲取列表中的單詞并嘗試將其與同一列表中的其他單詞組合起來。此功能還將向您展示形成復合詞的實際原子詞。

然而，它并不是很聰明。以下是它不會檢測的組合示例：[BREAKFASTBUFFET, BREAK, BREAKFAST, BUFFET]。

它使用了一個命名元組來暫時將實際單詞與附加到它的數字分開，假設分隔符始終是;。

我不認為正則表達式比簡單的字符串搜索有優勢。

如果您知道有關復合詞組成的更多條件，例如組件的最大數量，itertools組合生成器可能會幫助您顯著加快速度并避免錯過上面給出的示例。

反對回復 2024-01-16

3 回答
0 關注
273 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

在單詞列表中查找復合詞 - Python

在單詞列表中查找復合詞 - Python

3 回答

添加回答