首頁猿問 Python 中的句子拆分不超過字符數

Python 中的句子拆分不超過字符數

Python

湖上湖 2023-10-11 21:14:51

我有一個包含句子的字符串。如果該字符串包含的字符多于給定的數字。我想將此字符串拆分為幾個字符串，其字符數少于最大字符數，但仍包含完整的句子。我做了下面的操作，似乎運行良好，但不確定將其投入生產時是否會遇到錯誤。下面的看起來還好嗎？from nltk.tokenize import sent_tokenizesentences = sent_tokenize(my_text)sentences_split = []shortened_sentence = ""for idx, sentence in enumerate(sentences): if len(shortened_sentence) + len(sentence) < 5120: shortened_sentence += sentence if (len(shortened_sentence) + len(sentence) > 5120) or (idx + 1 == len(sentences)): sentences_split.append(shortened_sentence) shortened_sentence = "" print(sentences_split)

查看完整描述

1 回答

嗶嗶one

TA貢獻1854條經驗獲得超8個贊

為了更好地解釋我對第二個 if 塊問題的觀點（以注釋形式表達），請參閱以下示例。我們想要 max len=15 的字符串，即本例中的 1520 是 16。正如您所看到的，列表中的前 3 項是 5 + 6 + 4 = 15，因此，fisrt 應由列表中的前 3 項組成shortened_sentence。但事實并非如此。因為第二個if的邏輯不正確。

sentences = ['abcde', 'fghijk', 'lmno', 'pqr']

# we need sentences with less than 16 chars

print([len(sentence) for sentence in sentences])

sentences_split = []

shortened_sentence = ""

for idx, sentence in enumerate(sentences):

if len(shortened_sentence) + len(sentence) < 16:

shortened_sentence += sentence

if (len(shortened_sentence) + len(sentence) > 16) or (idx + 1 == len(sentences)):

sentences_split.append(shortened_sentence)

shortened_sentence = ""

print(sentences_split)

print([len(sentence) for sentence in sentences_split])

輸出

[5, 6, 4, 3]

['abcdefghijk', 'lmnopqr']

[11, 7]

將其與

sentences = ['abcde', 'fghijk', 'lmno', 'pqr']

# we need sentences with less than 16 chars

print([len(word) for word in sentences])

sentences_split = []

shortened_sentence = ""

for sentence in sentences:

if len(shortened_sentence) + len(sentence) < 16:

shortened_sentence += sentence

else:

sentences_split.append(shortened_sentence)

shortened_sentence = sentence

sentences_split.append(shortened_sentence)

print(sentences_split)

print([len(sentence) for sentence in sentences_split])

輸出

[5, 6, 4, 3]

['abcdefghijklmno', 'pqr']

[15, 3]

最后，如果您不確定“將其投入生產時是否會遇到錯誤” - 編寫測試，大量測試。這就是測試的目的 - 幫助最大限度地減少生產中的錯誤。

另請注意，第二個片段只是一個示例實現，還有其他可能的實現。

反對回復 2023-10-11

1 回答
0 關注
180 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

Python 中的句子拆分不超過字符數

Python 中的句子拆分不超過字符數

1 回答

添加回答