3 回答

TA貢獻1853條經驗 獲得超6個贊
這里另一種方法是利用和 Spark 相等運算符,它將數組作為任何其他類型進行處理,前提是對數組進行排序:array_sort
from pyspark.sql.functions import lit, array, array_sort, array_intersect
target_ar = ["Physics", "Math"]
search_ar = array_sort(array(*[lit(e) for e in target_ar]))
df.where(array_sort(array_intersect(df["Speciality"], search_ar)) == search_ar) \
.show(10, False)
# +-----------+-----------------------------------+
# |Studentname|Speciality |
# +-----------+-----------------------------------+
# |Alex |[Physics, Math, biology] |
# |Sam |[Economics, History, Math, Physics]|
# +-----------+-----------------------------------+
首先,我們找到公共元素,然后用于比較排序的數組。array_intersect(df["Speciality"], search_ar)==

TA貢獻1834條經驗 獲得超8個贊
使用高階函數應該是實現此目的最具可擴展性和效率的方法( Spark2.4filter )
from pyspark.sql import functions as F
df.withColumn("new", F.size(F.expr("""filter(Speciality, x-> x=='Math' or x== 'Physics')""")))\
.filter("new=2").drop("new").show(truncate=False)
+-----------+-----------------------------------+
|Studentname|Speciality |
+-----------+-----------------------------------+
|Alex |[Physics, Math, biology] |
|Sam |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+
如果你想使用一個喜歡來動態地做到這一點,你可以使用 和 然后打開 ( spark 2.4 ):arraya1F.array_exceptF.arrayfiltersize
a1=['Math','Physics']
df.withColumn("array", F.array_except("Speciality",F.array(*(F.lit(x) for x in a1))))\
.filter("size(array)= size(Speciality)-2").drop("array").show(truncate=False)
+-----------+-----------------------------------+
|Studentname|Speciality |
+-----------+-----------------------------------+
|Alex |[Physics, Math, biology] |
|Sam |[Economics, History, Math, Physics]|
+-----------+-----------------------------------+
要獲得計數,您可以放入而不是.count().show()

TA貢獻1900條經驗 獲得超5個贊
假設您有,則學生沒有重復項(例如Speciality
StudentName Speciality
SomeStudent ['Physics', 'Math', 'Biology', 'Physics']
你可以在熊貓中使用explodegroupby
所以,對于你的問題
# df is above dataframe
# Lookup subjects
a1 = ['Physics', 'Math']
gdata = df.explode('Speciality').groupby(['Speciality']).size().to_frame('Count')
gdata.loc[a1, 'Count']
# Count
# Speciality
# Physics 3
# Math 2
添加回答
舉報