我在 pyspark 中有兩個數據框:df1+-------+--------+----------------+-------------+ |new_lat|new_long| lat_long| State_name|+-------+--------+----------------+-------------+| 33.64| -117.63|[33.64,-117.625] |STATE 1 || 23.45| -101.54|[23.45,-101.542] |STATE 2 |+-------+--------+----------------+-------------+df2+---------+-----+--------------------+----------+------------+| label|value| dateTime| lat| long|+---------+-----+--------------------+----------+------------+|msg | 437|2019-04-06T05:10:...|33.6436263|-117.6255508||msg | 437|2019-04-06T05:10:...|33.6436263|-117.6255508||msg | 437|2019-04-06T05:10:...| 23.453622|-101.5423864||msg | 437|2019-04-06T05:10:...| 23.453622|-101.5420964|我想根據匹配的 lat,long 值加入這兩個表,最多 2 個小數點。所以我想要的輸出數據框是:DF3+---------+-----+--------------------+----------+------------+------+| label|value| dateTime| lat| long|state |+---------+-----+--------------------+----------+------------+-------|msg | 437|2019-04-06T05:10:...|33.6436263|-117.6255508|STATE 1|msg | 437|2019-04-06T05:10:...|33.6436263|-117.6255508|STATE 1|msg | 437|2019-04-06T05:10:...| 23.453622|-101.5423864|STATE 2|msg | 437|2019-04-06T05:10:...| 23.453622|-101.5420964|STATE 2考慮到 df2 有超過 100M 行,我怎樣才能有效地做到這一點。我試過df3=df1.join(df2, df1. new_lat == df2. lat, 'left')但不確定如何在 df1 中考慮最多兩位小數
根據匹配值(到某個小數點)加入兩個pyspark數據框
慕的地6264312
2023-02-22 16:01:01