首頁猿問你如何優化在 Python 中搜索大文件

你如何優化在 Python 中搜索大文件

Python

一只甜甜圈 2023-06-20 14:32:55

我有一個包含大約 800 萬行文件名的大文件，我正在嘗試搜索包含特定值的文件名。找到一個很好，但問題是我正在嘗試搜索大約 50k 個唯一值，而且搜索所需的時間非常長。with open('UniqueValueList.txt') as g: uniqueValues = g.read().splitlines()outF = open("Filenames_With_Unique_Values.txt", "w")with open('Filenames_File.txt') as f: fileLine = f.readlines() for line in fileLine: for value in uniqueValues: if value in line: outF.write(line)outF.close()我無法將 filenames 文件加載到內存中，因為它太大了。還有其他方法可以優化此搜索嗎？

查看完整描述

2 回答

慕無忌1623718

TA貢獻1744條經驗獲得超4個贊

我的兩個理論是 (1) 內存映射文件并為每個值搜索使用多行正則表達式，以及 (2) 將工作分配給多個子進程。我將兩者結合起來，得出以下結論。也許可以在父進程中執行 mmap 并共享，但我走的是簡單的路線，只是在每個子進程中都這樣做，假設操作系統會為您找出有效的共享。

import multiprocessing as mp

import os

import mmap

import re

def _value_find_worker_init(filename):

"""Called when initializing mp.Pool to open an mmaped file in subprocesses.

The file is `global mmap_file` so that the worker can find it.

"""

global mmap_file

filenames_fd = os.open(filename, os.O_RDONLY)

mmap_file = mmap.mmap(filenames_fd, length=os.stat(filename).st_size,

access=mmap.ACCESS_READ)

def _value_find_worker(value):

"""Return a list of matching lines in `global mmap_file`"""

# multiline regex for findall

regex = b"(?m)^.*?" + value + b".*?$"

matched = re.compile(regex).findall(mmap_file)

print(regex, matched)

return matched

def find_unique():

with open("UniqueValueList.txt", "rb") as g:

uniqueValues = [line.strip() for line in g]

with open('UniqueValueList.txt', "rb") as g:

uniqueValues = [line.strip() for line in g]

with mp.Pool(initializer=_value_find_worker_init,

initargs=("Filenames_File.txt",)) as pool:

matched_values = set()

for matches in pool.imap_unordered(_value_find_worker, uniqueValues):

matched_values.update(matches)

with open("Filenames_With_Unique_Values.txt", "wb") as outfile:

outfile.writelines(value + b"\n" for value in matched_values)

find_unique()

反對回復 2023-06-20

慕哥6287543

TA貢獻1831條經驗獲得超10個贊

我們可以將文件對象用作迭代器。迭代器會逐行返回每一行，可以處理。這不會將整個文件讀入內存，適合在 Python 中讀取大文件。

反對回復 2023-06-20

2 回答
0 關注
148 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

你如何優化在 Python 中搜索大文件

你如何優化在 Python 中搜索大文件

2 回答

添加回答