首頁猿問基于給定輸入數組篩選數據幀中的數組...

基于給定輸入數組篩選數據幀中的數組列 --Pyspark

Python

慕容森 2022-09-13 19:41:11

我有一個這樣的數據幀Studentname SpecialityAlex ["Physics","Math","biology"]Sam ["Economics","History","Math","Physics"]Claire ["Political science,Physics"]我想找到所有專攻[物理，數學]的學生，所以輸出應該有2行Alex，Sam這是我嘗試過的from pyspark.sql.functions import array_containsfrom pyspark.sql import functions as Fdef student_info(): student_df = spark.read.parquet("s3a://studentdata") a1=["Physics","Math"] df=student_df for a in a1: df= student_df.filter(array_contains(student_df.Speciality, a)) print(df.count())student_info()output:32想知道如何根據給定的數組子集過濾數組列

查看完整描述

3 回答

墨色風雨

TA貢獻1853條經驗獲得超6個贊

這里另一種方法是利用和 Spark 相等運算符，它將數組作為任何其他類型進行處理，前提是對數組進行排序：array_sort

from pyspark.sql.functions import lit, array, array_sort, array_intersect

target_ar = ["Physics", "Math"]

search_ar = array_sort(array(*[lit(e) for e in target_ar]))

df.where(array_sort(array_intersect(df["Speciality"], search_ar)) == search_ar) \

.show(10, False)

# +-----------+-----------------------------------+

# |Studentname|Speciality |

# +-----------+-----------------------------------+

# |Alex |[Physics, Math, biology] |

# |Sam |[Economics, History, Math, Physics]|

# +-----------+-----------------------------------+

首先，我們找到公共元素，然后用于比較排序的數組。array_intersect(df["Speciality"], search_ar)==

反對回復 2022-09-13

MMMHUHU

TA貢獻1834條經驗獲得超8個贊

使用高階函數應該是實現此目的最具可擴展性和效率的方法（ Spark2.4filter )

from pyspark.sql import functions as F

df.withColumn("new", F.size(F.expr("""filter(Speciality, x-> x=='Math' or x== 'Physics')""")))\

.filter("new=2").drop("new").show(truncate=False)

+-----------+-----------------------------------+

|Studentname|Speciality |

+-----------+-----------------------------------+

|Alex |[Physics, Math, biology] |

|Sam |[Economics, History, Math, Physics]|

+-----------+-----------------------------------+

如果你想使用一個喜歡來動態地做到這一點，你可以使用和然后打開（ spark 2.4 ）：arraya1F.array_exceptF.arrayfiltersize

a1=['Math','Physics']

df.withColumn("array", F.array_except("Speciality",F.array(*(F.lit(x) for x in a1))))\

.filter("size(array)= size(Speciality)-2").drop("array").show(truncate=False)

+-----------+-----------------------------------+

|Studentname|Speciality |

+-----------+-----------------------------------+

|Alex |[Physics, Math, biology] |

|Sam |[Economics, History, Math, Physics]|

+-----------+-----------------------------------+

要獲得計數，您可以放入而不是.count().show()

反對回復 2022-09-13

梵蒂岡之花

TA貢獻1900條經驗獲得超5個贊

假設您有，則學生沒有重復項（例如Speciality

StudentName Speciality

SomeStudent ['Physics', 'Math', 'Biology', 'Physics']

你可以在熊貓中使用explodegroupby

所以，對于你的問題

# df is above dataframe

# Lookup subjects

a1 = ['Physics', 'Math']

gdata = df.explode('Speciality').groupby(['Speciality']).size().to_frame('Count')

gdata.loc[a1, 'Count']

# Count

# Speciality

# Physics 3

# Math 2

反對回復 2022-09-13

3 回答
0 關注
130 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

基于給定輸入數組篩選數據幀中的數組列 --Pyspark

基于給定輸入數組篩選數據幀中的數組列 --Pyspark

3 回答

添加回答