亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

在Python中從Freebase提取數據轉儲

在Python中從Freebase提取數據轉儲

Cats萌萌 2021-04-08 15:15:36
從網站上下載數據轉儲Freebase Triples (freebase-rdf-latest.gz),打開和讀取此文件以提取信息的最佳過程是什么,比如說有關公司和企業的相對信息?(在Python中)據我所知,有一些軟件包可以實現此目標:在python中打開gz文件并讀取rdf文件,我不確定如何實現此目標...我的失敗嘗試python 3.6:import gzipwith gzip.open('freebase-rdf-latest.gz','r') as uncompressed_file:       for line in uncompressed_file.read():           print(line)之后,使用xml結構,我可以通過解析獲取信息,但無法讀取文件。
查看完整描述

1 回答

?
慕斯709654

TA貢獻1840條經驗 獲得超5個贊

問題在于gzip模塊會立即將整個文件解壓縮,然后將未壓縮的文件存儲在內存中。對于這么大的文件,更實際的方法是一次將文件解壓縮一點,流式傳輸結果。


#!/usr/bin/env python3

import io

import zlib


def stream_unzipped_bytes(filename):

    """

    Generator function, reads gzip file `filename` and yields

    uncompressed bytes.


    This function answers your original question, how to read the file,

    but its output is a generator of bytes so there's another function

    below to stream these bytes as text, one line at a time.

    """

    with open(filename, 'rb') as f:

        wbits = zlib.MAX_WBITS | 16  # 16 requires gzip header/trailer

        decompressor = zlib.decompressobj(wbits)

        fbytes = f.read(16384)

        while fbytes:

            yield decompressor.decompress(decompressor.unconsumed_tail + fbytes)

            fbytes = f.read(16384)



def stream_text_lines(gen):

    """

    Generator wrapper function, `gen` is a bytes generator.

    Yields one line of text at a time.

    """

    try:

        buf = next(gen)

        while buf:

            lines = buf.splitlines(keepends=True)

            # yield all but the last line, because this may still be incomplete

            # and waiting for more data from gen

            for line in lines[:-1]:

                yield line.decode()

            # set buf to end of prior data, plus next from the generator.

            # do this in two separate calls in case gen is done iterating,

            # so the last output is not lost.

            buf = lines[-1]

            buf += next(gen)

    except StopIteration:

        # yield the final data

        if buf:

            yield buf.decode()



# Sample usage, using the stream_text_lines generator to stream

# one line of RDF text at a time

bytes_generator = (x for x in stream_unzipped_bytes('freebase-rdf-latest.gz'))

for line in stream_text_lines(bytes_generator):

    # do something with `line` of text

    print(line, end='') 


查看完整回答
反對 回復 2021-04-27
  • 1 回答
  • 0 關注
  • 362 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號