亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

StringIO 類在 python 3 中不返回預期結果

StringIO 類在 python 3 中不返回預期結果

aluckdog 2022-05-11 17:10:51
在 python 版本 2 中工作的代碼在 python 3 中失敗。AttributeError: '_io.StringIO' object has no attribute 'name'這是代碼:!pip install warc3-wetimport warcimport requestsfrom contextlib import closingfrom io import StringIOdef get_partial_warc_file(url, num_bytes=1024 * 10):    with closing(requests.get(url, stream=True)) as r:        buf = StringIO(r.raw.read(num_bytes).decode('utf-8'))      return warc.WARCFile(fileobj=buf, compress=True)urls = {    'warc': 'https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz',    'wat':  'https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/wat/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.wat.gz',    'wet':  'https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/wet/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.wet.gz'}files = {file_type: get_partial_warc_file(url=url) for file_type, url in urls.items()}這是來源:https://dmorgan.info/posts/common-crawl-python/更新:此代碼返回記錄的元數據,我如何閱讀新聞文章?aws s3 cp --no-sign-request s3://commoncrawl/crawl-data/CC-NEWS/crawl-data/CC-NEWS/2019/08/CC-NEWS-20190824001636-00982.warc.gzimport warcvar = 0 with warc.open("/tmp/CC-NEWS-20190824001636-00982.warc") as f:    for record in f:        if var > 1:            break        else:            print (record.date, record.from_response, record.header, record.ip_address, record.offset, record.payload, record.type, record.url, record.write_to)        var = var + 1
查看完整描述

1 回答

?
蝴蝶刀刀

TA貢獻1801條經驗 獲得超8個贊

這是將返回新聞文章源代碼以及元數據的代碼。


# wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-07/segments/1454701145519.33/warc/CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz


# gunzip CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc.gz


#!pip install warc3-wet


import warc


var = -10 


with warc.open("CC-MAIN-20160205193905-00000-ip-10-236-182-209.ec2.internal.warc") as f:

    for record in f:

        if var > 1:

            break

        else:

            print (record.payload.read(), record.date, record.from_response, record.header, record.ip_address, record.offset, record.payload, record.type, record.url, record.write_to)

        var = var + 1


查看完整回答
反對 回復 2022-05-11
  • 1 回答
  • 0 關注
  • 168 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號