首頁猿問在Python中從Freebase...

在Python中從Freebase提取數據轉儲

Python

Cats萌萌 2021-04-08 15:15:36

從網站上下載數據轉儲Freebase Triples （freebase-rdf-latest.gz），打開和讀取此文件以提取信息的最佳過程是什么，比如說有關公司和企業的相對信息？（在Python中）據我所知，有一些軟件包可以實現此目標：在python中打開gz文件并讀取rdf文件，我不確定如何實現此目標...我的失敗嘗試python 3.6：import gzipwith gzip.open('freebase-rdf-latest.gz','r') as uncompressed_file: for line in uncompressed_file.read(): print(line)之后，使用xml結構，我可以通過解析獲取信息，但無法讀取文件。

查看完整描述

1 回答

慕斯709654

TA貢獻1840條經驗獲得超5個贊

問題在于gzip模塊會立即將整個文件解壓縮，然后將未壓縮的文件存儲在內存中。對于這么大的文件，更實際的方法是一次將文件解壓縮一點，流式傳輸結果。

#!/usr/bin/env python3

import io

import zlib

def stream_unzipped_bytes(filename):

"""

Generator function, reads gzip file `filename` and yields

uncompressed bytes.

This function answers your original question, how to read the file,

but its output is a generator of bytes so there's another function

below to stream these bytes as text, one line at a time.

"""

with open(filename, 'rb') as f:

wbits = zlib.MAX_WBITS | 16 # 16 requires gzip header/trailer

decompressor = zlib.decompressobj(wbits)

fbytes = f.read(16384)

while fbytes:

yield decompressor.decompress(decompressor.unconsumed_tail + fbytes)

fbytes = f.read(16384)

def stream_text_lines(gen):

"""

Generator wrapper function, `gen` is a bytes generator.

Yields one line of text at a time.

"""

try:

buf = next(gen)

while buf:

lines = buf.splitlines(keepends=True)

# yield all but the last line, because this may still be incomplete

# and waiting for more data from gen

for line in lines[:-1]:

yield line.decode()

# set buf to end of prior data, plus next from the generator.

# do this in two separate calls in case gen is done iterating,

# so the last output is not lost.

buf = lines[-1]

buf += next(gen)

except StopIteration:

# yield the final data

if buf:

yield buf.decode()

# Sample usage, using the stream_text_lines generator to stream

# one line of RDF text at a time

bytes_generator = (x for x in stream_unzipped_bytes('freebase-rdf-latest.gz'))

for line in stream_text_lines(bytes_generator):

# do something with `line` of text

print(line, end='')

反對回復 2021-04-27

1 回答
0 關注
362 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

在Python中從Freebase提取數據轉儲

在Python中從Freebase提取數據轉儲

1 回答

添加回答