首頁猿問使用 lxml etree 在...

使用 lxml etree 在 Python 中將大 xml 文件聚合到字典需要很長時間

Python

三國紛爭 2022-06-22 18:02:54

我在將大 xml 文件（~300MB）的值迭代和匯總到 python 字典中時遇到問題。我很快意識到，不是 lxml etrees iterparse 會減慢速度，而是每次迭代都訪問字典。以下是我的 XML 文件中的代碼片段： <timestep time="7.00"> <vehicle id="1" eclass="HBEFA3/PC_G_EU4" CO2="0.00" CO="0.00" HC="0.00" NOx="0.00" PMx="0.00" fuel="0.00" electricity="0.00" noise="54.33" route="!1" type="DEFAULT_VEHTYPE" waiting="0.00" lane="-27444291_0" pos="26.79" speed="4.71" angle="54.94" x="3613.28" y="1567.25"/> <vehicle id="2" eclass="HBEFA3/PC_G_EU4" CO2="3860.00" CO="133.73" HC="0.70" NOx="1.69" PMx="0.08" fuel="1.66" electricity="0.00" noise="65.04" route="!2" type="DEFAULT_VEHTYPE" waiting="0.00" lane=":1785290_3_0" pos="5.21" speed="3.48" angle="28.12" x="789.78" y="2467.09"/> </timestep> <timestep time="8.00"> <vehicle id="1" eclass="HBEFA3/PC_G_EU4" CO2="0.00" CO="0.00" HC="0.00" NOx="0.00" PMx="0.00" fuel="0.00" electricity="0.00" noise="58.15" route="!1" type="DEFAULT_VEHTYPE" waiting="0.00" lane="-27444291_0" pos="31.50" speed="4.71" angle="54.94" x="3617.14" y="1569.96"/> <vehicle id="2" eclass="HBEFA3/PC_G_EU4" CO2="5431.06" CO="135.41" HC="0.75" NOx="2.37" PMx="0.11" fuel="2.33" electricity="0.00" noise="68.01" route="!2" type="DEFAULT_VEHTYPE" waiting="0.00" lane="-412954611_0" pos="1.38" speed="5.70" angle="83.24" x="795.26" y="2467.99"/> <vehicle id="3" eclass="HBEFA3/PC_G_EU4" CO2="2624.72" CO="164.78" HC="0.81" NOx="1.20" PMx="0.07" fuel="1.13" electricity="0.00" noise="55.94" route="!3" type="DEFAULT_VEHTYPE" waiting="0.00" lane="22338220_0" pos="5.10" speed="0.00" angle="191.85" x="2315.21" y="2613.18"/> </timestep>每個時間步都有越來越多的車輛。該文件中有大約 11800 個時間步長?，F在我想根據它們的位置總結所有車輛的值。提供了 x、y 值，我可以將其轉換為 lat、long。我目前的方法是使用 lxml etree iterparse 遍歷文件，并使用 lat,long 作為 dict 鍵對值求和。我正在使用本文中的 fast_iter https://www.ibm.com/developerworks/xml/library/x-hiperfparse/但是，這種方法需要大約 25 分鐘來解析整個文件。我不確定如何以不同的方式做到這一點。我知道全局變量很糟糕，但我認為這會讓它更干凈？你能想到別的嗎？我知道這是因為字典。如果沒有聚合函數，fast_iter 大約需要 25 秒。

查看完整描述

1 回答

慕田峪7331174

TA貢獻1828條經驗獲得超13個贊

您的代碼很慢有兩個原因：

您做了不必要的工作，并使用了低效的 Python 語句。您不使用veh_id但仍用于int()轉換它。您創建一個空字典只是為了在單獨的語句中在其中設置 4 個鍵，您使用單獨的str()和 round()調用以及字符串連接，其中字符串格式化可以一步完成所有工作，您重復引用.attrib，因此 Python 必須重復查找該字典屬性為你。
當用于每個單獨的 (x, y) 坐標時，sumolib.net.convertXY2LonLat()實現效率非常低；pyproj.Proj()它每次都從頭開始加載偏移量和對象。pyproj.Proj()例如，我們可以通過緩存實例來切斷這里的重復操作。或者我們可以避免使用它，或者通過一步處理所有坐標來使用它一次。

第一個問題可以通過刪除不必要的工作和緩存屬性字典之類的東西、只使用一次以及在函數參數中緩存重復的全局名稱查找來避免（本地名稱使用起來更快）；關鍵字純粹是_...為了避免查找全局變量：

from operator import itemgetter

_fields = ('CO2', 'CO', 'NOx', 'PMx')

def aggregate(

vehicle,

_fields=_fields,

_get=itemgetter(*_fields, 'x', 'y'),

_conv=net.convertXY2LonLat,

# convert all the fields we need to floats in one step

*values, x, y = map(float, _get(vehicle.attrib))

# convert the coordinates to latitude and longitude

lng, lat = _conv(x, y)

# get the aggregation dictionary (start with an empty one if missing)

data = raw_pollution_data.setdefault(

f"{lng:.4f},{lat:.4f}",

dict.fromkeys(_fields, 0.0)

)

# and sum the numbers

for f, v in zip(_fields, values):

data[f] += v

為了解決第二個問題，我們可以用至少重用Proj()實例的東西來替換位置查找；在這種情況下，我們需要手動應用位置偏移：

proj = net.getGeoProj()

offset = net.getLocationOffset()

adjust = lambda x, y, _dx=offset[0], _dy=offset[1]: (x - _dx, y - _dy)

def longlat(x, y, _proj=proj, _adjust=adjust):

return _proj(*_adjust(x, y), inverse=True)

然后通過替換_conv本地名稱在聚合函數中使用它：

def aggregate(

vehicle,

_fields=_fields,

_get=itemgetter(*_fields, 'x', 'y'),

_conv=longlat,

# function body stays the same

這仍然會很慢，因為它要求我們(x, y)分別轉換每一對。

這取決于所使用的確切投影，但您可以簡單地量化x并y坐標自己進行分組。您將首先應用偏移量，然后將坐標“四舍五入”，轉換和舍入將實現的量相同。在投影(1, 0)和(0, 0)取經度差時，我們知道投影使用的粗略轉換率，然后將其除以 10.000 就可以得出聚合區域的大小x和y值：

(proj(1, 0)[0] - proj(0, 0)[0]) / 10000

對于標準的 UTM 投影，它給了我大約11.5，因此將x和y坐標乘以該因子應該可以得到大致相同數量的分組，而不必對每個時間步長數據點進行完整的坐標轉換：

proj = net.getGeoProj()

factor = abs(proj(1, 0)[0] - proj(0, 0)[0]) / 10000

dx, dy = net.getLocationOffset()

def quantise(v, _f=factor):

return v * _f // _f

def aggregate(

vehicle,

_fields=_fields,

_get=itemgetter(*_fields, 'x', 'y'),

_dx=dx, _dy=dy,

_quant=quantise,

*values, x, y = map(float, _get(vehicle.attrib))

key = _quant(x - _dx), _quant(y - _dy)

data = raw_pollution_data.setdefault(key, dict.fromkeys(_fields, 0.0))

for f, v in zip(_fields, values):

data[f] += v

對于問題中共享的非常有限的數據集，這給了我相同的結果。

但是，如果投影在經度上不同，這可能會導致地圖上不同點的結果失真。我也不知道您究竟需要如何聚合整個區域的車輛坐標。

如果您真的只能按經度和緯度 1/10000 度的區域進行聚合，那么如果您將整個 numpy 數組輸入到net.convertXY2LonLat(). 這是因為接受數組來批量pyproj.Proj()轉換坐標，節省了大量時間，避免進行數十萬次單獨的轉換調用，我們只需要進行一次調用。

與其使用 Python 字典和浮點對象來處理這個問題，不如在這里真正使用 Pandas DataFrame。它可以輕松地從每個元素屬性字典中獲取字符串（使用具有所有所需鍵的operator.itemgetter()對象可以非常快速地為您提供這些值），并在攝取數據時將所有這些字符串值轉換為浮點數。這些值以緊湊的二進制形式存儲在連續內存中，11800 行坐標和數據條目在這里不會占用太多內存。

因此，首先將您的數據加載到 DataFrame中，然后從該對象中一步轉換您的 (x, y) 坐標，然后使用Pandas 分組功能按區域聚合值：

from lxml import etree

import pandas as pd

import numpy as np

from operator import itemgetter

def extract_attributes(context, fields):

values = itemgetter(*fields)

for _, elem in context:

yield values(elem.attrib)

elem.clear()

while elem.getprevious() is not None:

del elem.getparent()[0]

del context

def parse_emissions(filename):

context = etree.iterparse(filename, tag="vehicle")

# create a dataframe from XML data a single call

coords = ['x', 'y']

entries = ['CO2', 'CO', 'NOx', 'PMx']

df = pd.DataFrame(

extract_attributes(context, coords + entries),

columns=coords + entries, dtype=np.float)

# convert *all coordinates together*, remove the x, y columns

# note that the net.convertXY2LonLat() call *alters the

# numpy arrays in-place* so we don’t want to keep them anyway.

df['lng'], df['lat'] = net.convertXY2LonLat(df.x.to_numpy(), df.y.to_numpy())

df.drop(coords, axis=1, inplace=True)

# 'group' data by rounding the latitude and longitude

# effectively creating areas of 1/10000th degrees per side

lnglat = ['lng', 'lat']

df[lnglat] = df[lnglat].round(4)

# aggregate the results and return summed dataframe

return df.groupby(lnglat)[entries].sum()

emissions = parse_emissions("/path/to/emission_output.xml")

print(emissions)

使用 Pandas、一個示例 sumo 網絡定義文件和一個重構的 XML 文件，通過重復您的 2 個示例時間步長條目 5900 次，我可以在大約 1 秒（總時間）內解析整個數據集。但是，我懷疑您的 11800 次集數太低（因為它小于 10MB XML 數據），所以我將 11800 * 20 == 236000 次樣本寫入文件，并且使用 Pandas 處理需要 22 秒。

您還可以查看GeoPandas，它可以讓您按地理區域進行匯總。

反對回復 2022-06-22

1 回答
0 關注
269 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

使用 lxml etree 在 Python 中將大 xml 文件聚合到字典需要很長時間

使用 lxml etree 在 Python 中將大 xml 文件聚合到字典需要很長時間

1 回答

添加回答