首頁猿問更新 90G json...

更新 90G json 格式文件中的每個項目/行（不必使用 python）

Python

慕田峪7331174 2022-10-11 21:13:21

我有一個由 json 項組成的 90G 文件。下面是僅包含 3 行的示例：{"description":"id1","payload":{"cleared":"2020-01-31T10:23:54Z","first":"2020-01-31T01:29:23Z","timestamp":"2020-01-31T09:50:47Z","last":"2020-01-31T09:50:47Z"}}{"description":"id2","payload":{"cleared":"2020-01-31T11:01:54Z","first":"2020-01-31T02:45:23Z","timestamp":"2020-01-31T09:50:47Z","last":"2020-01-31T09:50:47Z"}}{"description":"id3","payload":{"cleared":"2020-01-31T5:33:54Z","first":"2020-01-31T01:29:23Z","timestamp":"2020-01-31T07:50:47Z","last":"2019-01-31T04:50:47Z"}}最終目標是，對于每一行，獲得的最大值first，cleared并用最大值last更新timestamp。然后按時間戳對所有項目進行排序。暫時忽略排序。我最初將文件 jsonified 為 json 文件并使用以下代碼：#!/usr/bin/pythonimport json as simplejsonfrom collections import OrderedDictwith open("input.json", "r") as jsonFile: data = simplejson.load(jsonFile, object_pairs_hook=OrderedDict)for x in data: maximum = max(x['payload']['first'],x['payload']['cleared'],x['payload']['last']) x['payload']['timestamp']= maximumdata_sorted = sorted(data, key = lambda x: x['payload']['timestamp'])with open("output.json", "w") as write_file: simplejson.dump(data_sorted, write_file)上面的代碼適用于一個小測試文件，但是當我為 90G 文件運行它時腳本被殺死了。然后我決定使用以下代碼逐行處理它：#!/usr/bin/pythonimport sysimport json as simplejsonfrom collections import OrderedDictfirst_arg = sys.argv[1]data = []with open(first_arg, "r") as jsonFile: for line in jsonFile: y = simplejson.loads(line,object_pairs_hook=OrderedDict) payload = y['payload'] first = payload.get('first', None) clearedAt = payload.get('cleared') last = payload.get('last') lst = [first, clearedAt, last] maximum = max((x for x in lst if x is not None)) y['payload']['timestamp']= maximum data.append(y)with open("jl2json_new.json", "w") as write_file: simplejson.dump(data, write_file, indent=4)還是被打死了。所以我想知道解決這個問題的最佳方法是什么？我嘗試了以下方法，但沒有幫助： https ://stackoverflow.com/a/21709058/322541

查看完整描述

2 回答

慕斯王

TA貢獻1864條經驗獲得超2個贊

您必須對每一行進行所有處理 - 您將一行解析為y變量，對其進行處理，而不是將其寫入輸出文件，而是將其存儲在data列表中。當然，您最終會得到內存中的所有數據（未序列化，從 json 字符串到 Python 對象將占用數百 GB 的內存）。

如果您的代碼已經適用于小樣本，請更改它以編寫每一行：

#!/usr/bin/python

import sys

import json as simplejson

from collections import OrderedDict

first_arg = sys.argv[1]

with open(first_arg, "rt") as jsonFile, open("jl2json_new.json", "wt") as write_file:

for line in jsonFile:

y = simplejson.loads(line,object_pairs_hook=OrderedDict)

payload = y['payload']

first = payload.get('first', None)

clearedAt = payload.get('cleared')

last = payload.get('last')

lst = [first, clearedAt, last]

maximum = max((x for x in lst if x is not None))

y['payload']['timestamp']= maximum

write_file.write(simplejson.dumps(y) + "\n")

反對回復 2022-10-11

隔江千里

TA貢獻1906條經驗獲得超10個贊

mmap模塊允許您將內存“固定”到文件中。這使您無法閱讀整個內容。

import mmap

import json

from collections import OrderedDict

with open("test.json", "r+b") as f:

# memory-map the file, size 0 means whole file

mm = mmap.mmap(f.fileno(), 0)

# read content via standard file methods

json_dict = json.load(f, object_pairs_hook=OrderedDict)

print(json_dict)

# close the map

mm.close()

這個 stackoverflow，關于一次讀取大塊 json 數據，可能是另一種嘗試。

反對回復 2022-10-11

2 回答
0 關注
130 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

更新 90G json 格式文件中的每個項目/行（不必使用 python）

更新 90G json 格式文件中的每個項目/行（不必使用 python）

2 回答

添加回答