首頁猿問有效刪除元組列表中的部分重復項

有效刪除元組列表中的部分重復項

Python

慕后森 2023-09-19 14:49:42

我有一個元組列表，該列表的長度可以在 ~8 - 1000 之間變化，具體取決于元組的長度。列表中的每個元組都是唯一的。元組的長度為 N，其中每個條目都是通用單詞。示例元組的長度可以是 N(Word 1, Word 2, Word 3, ..., Word N)對于列表中的任何元組，所述元組中的元素 j 將是''或Word j一個非常簡單的字母示例是l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), ('', '', '', 'D'), ('A', '', '', 'D'), ('', 'B', '', '')]每個元組的每個位置要么具有相同的值，要么為空。''我想刪除所有在同一位置的另一個元組中具有所有非值的元組。例如，其中(A,B,'','')包含所有非''值(A,B,C,'')，因此應將其刪除。filtered_l = [(A,B,C,''),(A,'','',D)]元組的長度始終相同（不一定是 4）。元組的長度在 2-10 之間。最快的方法是什么？

查看完整描述

5 回答

冉冉說

TA貢獻1877條經驗獲得超1個贊

讓我們將每個元組概念化為一個二進制數組，其中 1 是“包含某些內容”，2 是“包含一個空字符串”。由于每個位置的項目都是相同的，因此我們不需要關心每個位置有什么，只關心有什么。

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]

l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]

# [3, 7, 8, 9, 2]

# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]

# that it's backwards doesn't really matter, since it's consistent

現在，我們可以遍歷該列表并構建一個沒有“重復項”的新數據結構。由于我們將元組編碼為二進制，因此我們可以通過執行按位運算來確定重復的、被另一個“包圍”的元組 - 給定aand b， if a | b == a，thena必須包含b。

codes = {}

for tup, b in zip(l, l_bin):

# check if any existing code contains the potential new one

# in this case, skip adding the new one

if any(a | b == a for a in codes):

continue

# check if the new code contains a potential existing one or more

# in which case, replace the existing code(s) with the new code

for a in list(codes):

if b | a == b:

codes.pop(a)

# and finally, add this code to our datastructure

codes[b] = tup

現在我們可以撤回我們的“過濾”元組列表：

output = list(codes.values())

# [('A', 'B', 'C', ''), ('A', '', '', 'D')]

請注意，(A, B, C, '')包含(A, B, '', '')和('', B, '', '')，并且(A, '', '', D')包含('', '', '', D)，所以這應該是正確的。

從 python 3.8 開始，dict保留插入順序，因此輸出應該與元組最初出現在列表中的順序相同。

這個解決方案不會非常有效，因為代碼的數量可能會堆積起來，但它應該在 O(n) 和 O(n^2) 之間，具體取決于最后剩下的唯一代碼的數量（并且因為每個元組的長度明顯小于的長度l，它應該更接近 O(n) 而不是 O(n^2)。

反對回復 2023-09-19

縹緲止盈

TA貢獻2041條經驗獲得超4個贊

特別是對于該限制，明顯的解決方案是將每個元組轉換為位掩碼，將它們累積在計數器數組中，執行子集和轉換，然后過濾數組l。

詳細代碼說明見評論。

時間復雜度顯然是n + m * 2^m，其中n是元組的數量，m是每個元組的長度。對于n == 1000和m == 10，這顯然比更快n^2。

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]

# assumes that l is not empty. (to access l[0])

# The case where l is empty is trivial to handle.

def tuple_to_mask(tuple_):

# convert the information whether each value in (tuple_) is empty to a bit mask

# (1 is empty, 0 is not empty)

return sum((value == '') << index for index, value in enumerate(tuple_))

count = [0] * (1 << len(l[0]))

for tuple_ in l:

# tuple_ is a tuple.

count[tuple_to_mask(tuple_)] += 1

# now count[mask] is the number of tuples in l with that mask

# transform the count array.

for dimension in range(len(l[0])):

for mask in range(len(count)):

if mask >> dimension & 1:

count[mask] += count[mask - (1 << dimension)]

# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)

# (i.e. all the bits that are set in mask_ are also set in mask)

filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]

print(filtered_l)

反對回復 2023-09-19

喵喵時光機

TA貢獻1846條經驗獲得超7個贊

我不確定這是否是最有效的或Python式的方法，但這將是直接的方法（同樣，也許其他人會提供更復雜的列表理解方法）：

看看這個：

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]

def item_in_list(item, l):

for item2comp in l:

if item!=item2comp:

found = True

for part,rhs_part in zip(item, item2comp):

if part!='' and part!=rhs_part:

found = False

break

if found:

return True

return False

new_arr = []

for item in l:

if not item_in_list(item, l):

new_arr.append(item)

print(new_arr)

輸出：

[('A', 'B', 'C', ''), ('A', '', '', 'D')]

我認為時間復雜度是 - O((N**2)*M)

N - 列表中的元素數量

M - 每個元素中的零件數

反對回復 2023-09-19

哈士奇WWW

TA貢獻1799條經驗獲得超6個贊

這些字符串始終位于同一位置，因此我將它們替換為布爾值，以便更輕松地比較它們。首先，我進行排序，然后僅保留與所有其他元素相比，前一個元素在任何地方都為 true 或與后一個元素相同的元素。然后，當比較完成后，我會將其從列表中刪除。

f = sorted(map(lambda x: list(map(bool, x)), l), key=sum, reverse=True)

to_keep = []

while len(f) > 1:

? ? if all(map(lambda x, y: True if x == y or x else False, f[0], f[1])):

? ? ? ? to_keep.append(len(l) - len(f) + 1)

? ? f = f[1:]

print([l[i] for i in to_keep])

[('A', 'B', 'C', ''), ('A', '', '', 'D')]

速度為 43.7 μs，也是投票最高答案的兩倍。

反對回復 2023-09-19

莫回無

TA貢獻1865條經驗獲得超7個贊

L = [('A', 'B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]

keys = collections.defaultdict(lambda: collections.defaultdict(set))

# maintain a record of tuple-indices that contain each character in each position

for i,t in enumerate(L):

for c,e in enumerate(t):

if not e: continue

keys[e][c].add(i)

delme = set()

for i,t in enumerate(L):

collocs = set.intersection(*[keys[e][c] for c,e in enumerate(t) if e])

if len(collocs)>1: # if all characters appear in this position in >1 index

# ignore the collocation with the most non-empty characters

# mark the rest for deletion

C = max(collocs, key=lambda i: sum(bool(e) for bool in L[i]))

for c in collocs:

if c!=C: delme.add(c)

filtered = [t for i,t in enumerate(L) if i not in delme]

反對回復 2023-09-19

5 回答
0 關注
185 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

有效刪除元組列表中的部分重復項

有效刪除元組列表中的部分重復項

5 回答

添加回答