已解決430363個問題，去搜搜看，總會有你想問的

使用不同格式（csv，json，avro）將數據加載到pd.DataFrame的最快方法

首頁猿問使用不同格式（csv，json，a...

使用不同格式（csv，json，avro）將數據加載到pd.DataFrame的最快方法

Python

蠱毒傳說 2021-05-07 10:15:22

我們正在加載從google bigquery到的大量數據pandas dataframe（直接作為消費pandas，也作為消費xgbMatrix）。BQ導出格式CSV，JSON并且AVRO，我們的數據有日期，整數，浮點數和字符串，并且通常是“寬”（多列）。我們的第一種方法是將數據導入為CSV，但是解析時間很長：(32 GB,126 files,CSV) -> 25 min解析代碼：def load_table_files_to_pandas(all_files, table_ref):# load files to pandasdict_dtype = {}date_cols = []client = bigquery.Client() # create a bq clienttable = client.get_table(table_ref)for field in table.schema: pd_dtypes = {'string':'object', 'date':'object', 'float':'float64', 'integer':'float64' } dict_dtype[field.name] = pd_dtypes[field.field_type.lower()] if field.field_type.lower()== 'date': date_cols.append(field.name)print('start reading data') df_from_each_file = []for f in all_files: # looping over files df_from_each_file.append(pd.read_csv(f, dtype = dict_dtype, parse_dates = date_cols)) print('memory in use = {}'.format(psutil.virtual_memory().percent))df = pd.concat(df_from_each_file, ignore_index=True)print('end reading data')return df哪種格式解析速度更快pandas？[Avro,CSV,JSON]？也許有第三個人沒有被考慮？另外，我們還嘗試dask|csv直接從存儲和本地磁盤進行嘗試，但是解析時間幾乎相同。

查看完整描述

2 回答

搖曳的薔薇

TA貢獻1793條經驗獲得超6個贊

當處理如此大的文件時，我將使用Parquet格式的Spark。這樣，您可以擴大讀取和計算的范圍。熊貓不是為如此大的文件制作的。

反對回復 2021-05-11

2 回答
0 關注
234 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

使用不同格式（csv，json，avro）將數據加載到pd.DataFrame的最快方法

使用不同格式（csv，json，avro）將數據加載到pd.DataFrame的最快方法

2 回答

添加回答

使用不同格式（csv，json，avro）將數據加載到pd.DataFrame的最快方法

使用不同格式（csv，json，avro）將數據加載到pd.DataFrame的最快方法