亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

如何獲取列表中附加的非字母和非數字字符?

如何獲取列表中附加的非字母和非數字字符?

慕容3067478 2023-05-09 15:01:28
這是關于簡單的字數統計,收集文檔中出現的單詞以及出現的頻率。我嘗試編寫一個函數,輸入是文本行列表。我遍歷所有行,將它們拆分成單詞,累積識別出的單詞,最后返回完整列表。首先,我有一個 while 循環遍歷列表中的所有字符,但忽略空格。在這個 while 循環中,我也嘗試識別我有什么樣的詞。在這種情況下,有三種詞:以字母開頭的;以數字開頭的;以及那些只包含一個既不是字母也不是數字的字符的。我有三個 if 語句來檢查我有什么樣的角色。當我知道我遇到了什么樣的詞時,我會嘗試提取這個詞本身。當單詞以字母或數字開頭時,我將所有連續的同類字符作為單詞的一部分。但是,在第三個 if 語句中,當我處理當前字符既不是字母也不是數字的情況時,我遇到了問題。當我輸入時wordfreq.tokenize(['15,    delicious&   Tarts.'])我希望輸出是['15', ',', 'delicious', '&', 'tarts', '.']當我在 Python 控制臺中測試函數時,它看起來像這樣:PyDev console: starting.Python 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) [Clang 6.0 (clang-600.0.57)] on darwinimport wordfreqwordfreq.tokenize(['15,    delicious&   Tarts.'])['15', 'delicious', 'tarts']該函數既不考慮逗號、符號也不考慮點!我該如何解決?請參閱下面的代碼。( lower() 方法是因為我想忽略大寫,例如 'Tarts' 和 'tarts' 實際上是同一個詞。)# wordfreq.pydef tokenize(lines):    words = []    for line in lines:        start = 0        while start < len(line):            while line[start].isspace():                start = start + 1            if line[start].isalpha():                end = start                while line[end].isalpha():                    end = end + 1                word = line[start:end]                words.append(word.lower())                start = end            elif line[start].isdigit():                end = start                while line[end].isdigit():                    end = end + 1                word = line[start:end]                words.append(word)                start = end            else:                words.append(line[start])            start = start + 1    return words
查看完整描述

3 回答

?
qq_遁去的一_1

TA貢獻1725條經驗 獲得超8個贊

我發現了問題所在。線


start = start + 1

應該在最后一個 else 語句中的位置。


所以我的代碼看起來像這樣,并為我提供了上面指定的所需輸入:


def tokenize(lines):

    words = []

    for line in lines:

        start = 0

        while start < len(line):

            while line[start].isspace():

                start = start + 1

            end = start

            if line[start].isalpha():

                while line[end].isalpha():

                    end = end + 1

                word = line[start:end]

                word = word.lower()

                words.append(word)

                start = end

            elif line[start].isdigit():

                while line[end].isdigit():

                    end = end + 1

                word = line[start:end]

                words.append(word)

                start = end

            else:

                word = line[start]

                words.append(word)

                start = start + 1

    return words

但是,當我使用下面的測試腳本來確保沒有遺漏函數“tokenize”的極端情況時;...


import io

import sys

import importlib.util


def test(fun,x,y):

    global pass_tests, fail_tests

    if type(x) == tuple:

        z = fun(*x)

    else:

        z = fun(x)

    if y == z:

        pass_tests = pass_tests + 1

    else:

        if type(x) == tuple:

            s = repr(x)

        else:

            s = "("+repr(x)+")"

        print("Condition failed:")

        print("   "+fun.__name__+s+" == "+repr(y))

        print(fun.__name__+" returned/printed:")

        print(str(z))

        fail_tests = fail_tests + 1


def run(src_path=None):

    global pass_tests, fail_tests


    if src_path == None:

        import wordfreq

    else:

        spec = importlib.util.spec_from_file_location("wordfreq", src_path+"/wordfreq.py")

        wordfreq = importlib.util.module_from_spec(spec)

        spec.loader.exec_module(wordfreq)


    pass_tests = 0

    fail_tests = 0

    fun_count  = 0


    def printTopMost(freq,n):

        saved = sys.stdout

        sys.stdout = io.StringIO()

        wordfreq.printTopMost(freq,n)

        out = sys.stdout.getvalue()

        sys.stdout = saved

        return out


    if hasattr(wordfreq, "tokenize"):

        fun_count = fun_count + 1

        test(wordfreq.tokenize, [], [])

        test(wordfreq.tokenize, [""], [])

        test(wordfreq.tokenize, ["   "], [])

        test(wordfreq.tokenize, ["This is a simple sentence"], ["this","is","a","simple","sentence"])

        test(wordfreq.tokenize, ["I told you!"], ["i","told","you","!"])

        test(wordfreq.tokenize, ["The 10 little chicks"], ["the","10","little","chicks"])

        test(wordfreq.tokenize, ["15th anniversary"], ["15","th","anniversary"])

        test(wordfreq.tokenize, ["He is in the room, she said."], ["he","is","in","the","room",",","she","said","."])

    else:

        print("tokenize is not implemented yet!")


    if hasattr(wordfreq, "countWords"):

        fun_count = fun_count + 1

        test(wordfreq.countWords, ([],[]), {})

        test(wordfreq.countWords, (["clean","water"],[]), {"clean":1,"water":1})

        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],[]), {"clean":1,"water":2,"is":1,"drinkable":1})

        test(wordfreq.countWords, (["clean","water","is","drinkable","water"],["is"]), {"clean":1,"water":2,"drinkable":1})

    else:

        print("countWords is not implemented yet!")


    if hasattr(wordfreq, "printTopMost"):

        fun_count = fun_count + 1

        test(printTopMost,({},10),"")

        test(printTopMost,({"horror": 5, "happiness": 15},0),"")

        test(printTopMost,({"C": 3, "python": 5, "haskell": 2, "java": 1},3),"python                  5\nC                       3\nhaskell                 2\n")

    else:

        print("printTopMost is not implemented yet!")


    print(str(pass_tests)+" out of "+str(pass_tests+fail_tests)+" passed.")


    return (fun_count == 3 and fail_tests == 0)


if __name__ == "__main__":

    run()

...我得到以下輸出:


/usr/local/bin/python3.7 "/Users/ericjohannesson/Documents/Frist?ende kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py"

Traceback (most recent call last):

  File "/Users/ericjohannesson/Documents/Frist?ende kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 81, in <module>

    run()

  File "/Users/ericjohannesson/Documents/Frist?ende kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 50, in run

    test(wordfreq.tokenize, ["   "], [])

  File "/Users/ericjohannesson/Documents/Frist?ende kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/test.py", line 10, in test

    z = fun(x)

  File "/Users/ericjohannesson/Documents/Frista?ende kurser/DAT455 – Introduktion till programmering med Python/lab1/Laborations/Laboration_1/wordfreq.py", line 44, in tokenize

    while line[start].isspace():

IndexError: string index out of range

為什么說字符串索引超出范圍?我該如何解決這個問題?


查看完整回答
反對 回復 2023-05-09
?
回首憶惘然

TA貢獻1847條經驗 獲得超11個贊

我不確定你為什么要上下做,但這是你如何拆分它的方法:


input = ['15,    delicious&   Tarts.']

line = input[0]

words = line.split(' ')

words = [word for word in words if word]

out:

['15,', 'delicious&', 'Tarts.']

編輯,看到你編輯了你想要的輸出方式。只需跳過這一行即可獲得該輸出:


    words = [word for word in words if word]


查看完整回答
反對 回復 2023-05-09
?
素胚勾勒不出你

TA貢獻1827條經驗 獲得超9個贊

itertools.groupby可以大大簡化這一點?;旧?,您根據字符的類別或類型(字母、數字或標點符號)對字符串中的字符進行分組。在此示例中,我只定義了這三個類別,但您可以根據需要定義任意數量的類別。任何不匹配任何類別的字符(本例中為空格)將被忽略:


def get_tokens(string):

    from itertools import groupby

    from string import ascii_lowercase, ascii_uppercase, digits, punctuation as punct

    alpha = ascii_lowercase + ascii_uppercase


    yield from ("".join(group) for key, group in groupby(string, key=lambda char: next((category for category in (alpha, digits, punct) if char in category), "")) if key)


print(list(get_tokens("15,    delicious&   Tarts.")))

輸出:


['15', ',', 'delicious', '&', 'Tarts', '.']

>>> 


查看完整回答
反對 回復 2023-05-09
  • 3 回答
  • 0 關注
  • 158 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號