首頁猿問查找字符串中最長的重復序列

查找字符串中最長的重復序列

Python 正則表達式

開心每一天1111 2019-12-26 14:19:14

我需要找到一個字符串中最長的序列，但要注意的是，該序列必須重復三次或更多次。因此，例如，如果我的字符串是：fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld那么我想返回值“ helloworld ”。我知道完成此操作的幾種方法，但是我面臨的問題是實際的字符串非常大，因此我確實在尋找一種可以及時實現的方法。

查看完整描述

3 回答

慕田峪9158850

TA貢獻1794條經驗獲得超8個贊

這個問題是最長重復子串問題的一個變體，并且存在一個使用后綴樹的O（n）時間算法來解決。這個想法（如Wikipedia所建議）是構造后綴樹（時間O（n）），用后代數注釋樹中的所有節點（使用DFS的時間O（n）），然后找到樹中具有至少三個后代的最深節點（使用DFS的時間O（n））。該總體算法花費時間O（n）。

也就是說，眾所周知，后綴樹很難構建，因此您可能想要在嘗試此實現之前，找到一個為您實現后綴樹的Python庫?？焖俚腉oogle搜索打開了這個庫，盡管我不確定這是否是一個很好的實現。

希望這可以幫助！

反對回復 2019-12-26

犯罪嫌疑人X

TA貢獻2080條經驗獲得超4個贊

使用defaultdict對從輸入字符串中每個位置開始的每個子字符串進行計數。OP尚不清楚是否應該包含重疊的匹配項，這種蠻力方法包括它們。

from collections import defaultdict

def getsubs(loc, s):

substr = s[loc:]

i = -1

while(substr):

yield substr

substr = s[loc:i]

i -= 1

def longestRepetitiveSubstring(r, minocc=3):

occ = defaultdict(int)

# tally all occurrences of all substrings

for i in range(len(r)):

for sub in getsubs(i,r):

occ[sub] += 1

# filter out all substrings with fewer than minocc occurrences

occ_minocc = [k for k,v in occ.items() if v >= minocc]

if occ_minocc:

maxkey = max(occ_minocc, key=len)

return maxkey, occ[maxkey]

else:

raise ValueError("no repetitions of any substring of '%s' with %d or more occurrences" % (r,minocc))

印刷品：

('helloworld', 3)

反對回復 2019-12-26

ibeautiful

TA貢獻1993條經驗獲得超6個贊

讓我們從頭開始，計算頻率，并在出現最頻繁的元素3次或更多次后立即停止。

from collections import Counter

a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'

times=3

for n in range(1,len(a)/times+1)[::-1]:

substrings=[a[i:i+n] for i in range(len(a)-n+1)]

freqs=Counter(substrings)

if freqs.most_common(1)[0][1]>=3:

seq=freqs.most_common(1)[0][0]

break

print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

結果：

>>> sequence 'helloworld' of length 10 occurs 3 or more times

編輯：如果您感覺自己正在處理隨機輸入，并且公共子字符串的長度應該很小，那么最好以小子字符串開始（如果需要速度），而當找不到任何出現在該子字符串時停止至少3次：

from collections import Counter

a='fdwaw4helloworldvcdv1c3xcv3xcz1sda21f2sd1ahelloworldgafgfa4564534321fadghelloworld'

times=3

for n in range(1,len(a)/times+1):

substrings=[a[i:i+n] for i in range(len(a)-n+1)]

freqs=Counter(substrings)

if freqs.most_common(1)[0][1]<3:

n-=1

break

else:

seq=freqs.most_common(1)[0][0]

print "sequence '%s' of length %s occurs %s or more times"%(seq,n,times)

與上述相同的結果。

反對回復 2019-12-26

3 回答
0 關注
1060 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

查找字符串中最長的重復序列

查找字符串中最長的重復序列

3 回答

添加回答