亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

將 n 元語法與組重復項進行比較

將 n 元語法與組重復項進行比較

偶然的你 2023-08-08 16:00:15
我正在編寫一個腳本,如果兩行之間的三個連續單詞匹配,該腳本將認為兩行是重復的。假設我當前的數據集是:1 A Course of Pure Mathematics by G. H. Hardy2 Agile Software Development, Principles, Patterns, and Practices by Robert C. Martin3 Advanced Programming in the UNIX Environment, 3rd Edition4 Advanced Selling Strategies: Brian Tracy5 Advanced Programming in the UNIX(R) Environment6 Alex's Adventures in Numberland: Dispatches from the Wonderful World of Mathematics by Alex Bellos, Andy Riley7 Advertising Secrets of the Written Word: The Ultimate Resource on How to Write Powerful Advertising8 Agile Software Development, Principles, Patterns, and Practices9 A Course of Pure Mathematics (Cambridge Mathematical Library) 10th Edition by G. H. Hardy 10 Alex’s Adventures in Numberland11 Advertising Secrets of the Written Word12 Alex's Adventures in Numberland Paperback by Alex Bellos這里,1 和 9 是重復的,因為course pure mathematics匹配。2 和 8 是重復的,因為advanced programming unix匹配。3 和 5 是重復的,因為advanced programming unix匹配。等等 ...
查看完整描述

1 回答

?
寶慕林4294392

TA貢獻2021條經驗 獲得超8個贊

OP 這里,解決方案似乎是:


import re

from nltk.util import ngrams


OriginalBooksList = list()

booksAfterRemovingStopWords = list()

booksWithNGrams = list()

stopWords = ['I', 'a', 'about', 'an', 'are', 'as', 'at', 'be', 'by', 'com', 'for', 'from', 'how', 'in', 'is', 'it', 'of', 'on', 'or', 'that', 'the', 'this', 'to', 'was', 'the',

             'and', 'A', 'About', 'An', 'Are', 'As', 'At', 'Be', 'By', 'Com', 'For', 'From', 'How', 'In', 'Is', 'It', 'Of', 'On', 'Or', 'That', 'The', 'This', 'To', 'Was', 'The', 'And']


with open('UnifiedBookList.txt') as fin:

    for line_no, line in enumerate(fin):

        OriginalBooksList.append(line)

        line = re.sub(r'[^\w\s]', ' ', line)  # replace punctuation with space

        line = re.sub(' +', ' ', line)  # replace multiple space with one

        line = line.lower()  # to lower case

        if line.strip() and len(line.split()) > 2:  # line can not be empty and line must have more than 2 words

            booksAfterRemovingStopWords.append(' '.join([i for i in line.split(

            ) if i not in stopWords]))  # Remove Stop Words And Make Sentence



for line_no, line in enumerate(booksAfterRemovingStopWords):

    tokens = line.split(" ")

    output = list(ngrams(tokens, 3))

    temp = list()


    temp.append(OriginalBooksList[line_no])  # Adding original line

    for x in output:  # Adding n-grams

        temp.append(' '.join(x))

    booksWithNGrams.append(temp)


while booksWithNGrams:

    first_element = booksWithNGrams.pop(0)

    x = 0

    for mylist in booksWithNGrams:

        if set(first_element) & set(mylist):

            if x == 0:

                print(first_element[0])

                x = 1

                # print(set(first_element) & set(mylist))

            print(mylist[0])

            booksWithNGrams.remove(mylist)

    x = 0


查看完整回答
反對 回復 2023-08-08
  • 1 回答
  • 0 關注
  • 123 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號