2 回答

TA貢獻1829條經驗 獲得超7個贊
我將您的代碼片段轉換為一個函數,該函數將包含輸入文件的文件夾的路徑作為參數。以下代碼獲取指定文件夾中的所有文件,并為該文件夾中的每個文件生成 cleaned_output.txt 和 test.txt 到新創建的輸出目錄。輸出文件在末尾附加了它們生成的輸入文件的名稱,以便更容易區分它們,但您可以更改它以滿足您的需要。
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import os
path = 'input/'
def clean_text(path):
try:
os.mkdir('output')
except:
pass
out_path = 'output/'
files = [f for f in os.listdir(path) if os.path.isfile(path+f)]
file_paths = [path+f for f in files]
file_names = [f.strip('.txt') for f in files]
for idx, f in enumerate(file_paths):
stop_words = set(stopwords.words('english'))
file1 = open(f)
line = file1.read()
words = line.split()
words = [word.lower() for word in words]
print(words)
for r in words:
if not r in stop_words:
appendFile = open(out_path + 'cleaned_output_{}.txt'.format(file_names[idx]),'a')
appendFile.write(" "+r)
appendFile.close()
with open(out_path + 'cleaned_output_{}.txt'.format(file_names[idx])) as input_file:
count = Counter(word for line in input_file
for word in line.split())
print(count.most_common(10), file=open(out_path + 'test_{}.txt'.format(file_names[idx]),'a'))
clean_text(path)
這是你要找的嗎?
添加回答
舉報